An End-to-End Multimodal Educational AI System Powered by Gemma 4.
WhyBuddy is an AI-powered educational assistant designed for children under 8 years old that transforms “why” questions into structured learning experiences. It combines natural language reasoning, knowledge retrieval, image generation, text-to-speech, and video synthesis into a unified pipeline.
At its core, WhyBuddy leverages Gemma 4 (E2B instruction-tuned model) for child-friendly reasoning and explanation generation, paired with Stable Diffusion (DreamShaper-7) for visual storytelling and Edge TTS for narration. The system outputs a complete learning artifact: a narrated educational video with step-by-step visual slides. Reference Hakathon Writeup
WhyBuddy is built as a modular multimodal pipeline with five core layers:

- Model:
google/gemma-4-e2b-it(Loaded via Kaggle Hub + HuggingFace AutoProcessor) - Role: Converts raw child questions into structured explanations.
- Generates:
TEACHER_ANSWERIMAGE_PROMPTS(7-10 visual scene prompts)
- Guardrails: Gemma is guided using a carefully engineered
SYSTEM_PROMPTthat enforces child-safe language, a structured output format, and a strict separation of explanations and visuals.
To improve factual grounding and reduce hallucinations, WhyBuddy integrates a lightweight retrieval system:
- Uses
wikipedia.search()+wikipedia.summary(). - Activated dynamically via a router module.
- Injects real-time factual context directly into the Gemma prompts.
A simple but effective rule-based router classifies user input to dynamically modify the prompt structure before sending it to Gemma:
| Mode | Trigger Condition | Action / Prompt Modification |
|---|---|---|
| Math Mode | Detects mathematical symbols (+, -, *, /, =) |
Steers Gemma toward step-by-step mathematical breakdown. |
| Knowledge Mode | Detects interrogative keywords (what, why, how, explain) |
Activates the Wikipedia retrieval layer for grounding. |
| Tutor Mode | Default state | Maintains a conversational, interactive learning tone. |
- Model:
DreamShaper-7 - Purpose: Converts text prompts into engaging, cartoon-style learning visuals.
- Pipeline: Each generated
IMAGE_PROMPTSsentence is automatically prefixed with "Cartoon style of..." and fed into Stable Diffusion. - Output: 7–10 sequential, slide-based educational images per question to maintain concept-to-visual alignment.
- Engine:
edge-tts(en-US-GuyNeural) - Purpose: Converts the
TEACHER_ANSWERinto natural, human-like narration. - Sanitization: Includes text cleaning protocols such as emoji removal, unicode normalization, and punctuation filtering to ensure seamless audio playback.
WhyBuddy stitches the assets together into a full educational video:
- Images are generated via Stable Diffusion and converted to
ImageClipelements. - Direct slide durations are assigned (3–4 seconds per slide).
- Audio generated by Edge TTS is overlaid.
- Final Output:
🎥 final_{uuid}.mp4
Make sure you have Python 3.8+ installed on your system.
- Clone the repository:
git clone https://github.com/Esabelle11/WhyBuddy.git cd WhyBuddy - Install dependencies (for first time run):
pip install -r requirements.txt
- run project with streamlit framework:
streamlit run app.py
- Ending by:
control + c
Gemma 4 serves as the central orchestration engine of WhyBuddy, driving three key elements:
Gemma is constrained to always produce strict, deterministic blocks for predictable downstream parsing:
TEACHER_ANSWER: [Child-friendly explanation text goes here]
IMAGE_PROMPTS: [Prompt 1 | Prompt 2 | Prompt 3...]
Through strict system prompt engineering, Gemma is forced to:
- Simplify vocabulary drastically.
- Utilize relatable analogies (e.g., toys, food, school, animals).
- Avoid technical jargon while remaining factually accurate.
- Maintain an encouraging, safe, and warm tone.
WhyBuddy dynamically tracks follow-up questions using an internal state counter (chat_state["depth"]). This allows Gemma to scale its explanation depth fluidly:
- Level 0: Normal, comprehensive child-friendly explanation.
- Level 1: Simpler language with shorter sentence structures.
- Level 2: Centered heavily around a single, real-world analogy.
- Level 3: Ultra-simple, high-engagement toddler explanation.
- Model Loading Complexity (Gemma + SD)
- Problem: Multi-gigabyte models caused initialization timeouts and memory failures in constrained Kaggle notebook environments.
- Solution: Implemented explicit Kaggle Hub downloading (kagglehub.model_download) paired with conditional fallback error handling and automated GPU/VRAM capability detection.
- Structured Output Parsing
- Problem: LLMs can occasionally output unstructured or inconsistent text, breaking regex patterns in downstream multimedia layers.
- Solution: Enforced strict JSON/Marker formats via the SYSTEM_PROMPT combined with a robust custom fallback parser that splits content cleanly on pre-defined delimiter boundaries.
- Multimedia Synchronization
- Problem: Aligning dynamic audio narration length with static generated visual slides.
- Solution: Applied a fixed per-slide timing strategy (3–4 seconds per image scene) synchronized against the total narrative length, cleanly concatenated and exported via MoviePy.
- Performance Bottlenecks
- Problem: Stable Diffusion inference scales poorly and slows down user experience loops.
- Solution: Optimized generation by reducing inference to 25 steps, utilizing mixed-precision GPU acceleration (torch.float16), and maintaining a sequential, controlled execution pipeline.
- Child Safety & Content Control
- Problem: Risks of generating complex, scary, or unsafe edge-case text/images.
- Solution: Embedded a rigid safety verification layer directly within the SYSTEM_PROMPT alongside automated emoji, unicode, and profanity sanitization filters.
- Wikipedia Retrieval: Drastically reduces hallucination risk and anchors abstract definitions in verified reality.
- Gemma 4: Exceptional instruction-following capabilities and structured parsing reliability compared to alternative lightweight open-weights models. Stable Diffusion (DreamShaper): Successfully translates complex, abstract ideas into highly engaging visual learning cues optimized for early childhood psychology.
- Edge TTS: Extremely lightweight, fast API requiring zero local resource overhead while delivering human-grade vocal inflection.
- MoviePy: Provides a reliable programmatic multimedia stitching framework without requiring external heavy video rendering engines.
WhyBuddy produces a completely autonomous, AI-generated learning experience containing:
- 📝 Explanation Text tailored to the child's exact age and curiosity level (Gemma 4).
- 🎨 Visual Storybook Slides matching the narrative context (Stable Diffusion).
- 🔊 Voice Narration with clear, friendly pronounciation (Edge TTS).
- 🎬 An Educational Video File ready for playback (.mp4 via MoviePy).
WhyBuddy demonstrates how Gemma 4 can serve as the reliable reasoning core of a complex, multimodal educational system. By bridging the gap between LLM reasoning, retrieval augmentation, image generation, speech synthesis, and video composition, WhyBuddy showcases full-stack AI system engineering engineered for real-world interactive learning.
