Skip to content

Multimodal

gitpavleenbali edited this page Feb 17, 2026 · 2 revisions

Multimodal

Process images, audio, and video with AI agents.

See Multimodal-Module for full documentation.

Quick Start

from pyai.multimodal import ImageContent, AudioContent

# Image analysis
image = ImageContent.from_file("photo.jpg")
description = image.describe()

# Audio transcription
audio = AudioContent.from_file("recording.mp3")
text = audio.transcribe()

Features

  • Image understanding and analysis
  • Audio transcription
  • Video frame analysis
  • Multi-modal conversations
  • Format conversion

Supported Formats

Type Formats
Image PNG, JPG, GIF, WebP
Audio MP3, WAV, M4A, FLAC
Video MP4, MOV, AVI

Related Pages

🧠 PYAI Wiki

Home


πŸš€ Getting Started


πŸ’‘ Core Concepts


🎯 One-Liner APIs


πŸ€– Agent Framework


πŸ”— Multi-Agent


πŸ› οΈ Tools & Skills


🏒 Enterprise


πŸŽ™οΈ Voice


πŸ–ΌοΈ Multimodal


πŸ“Š Vector DB


🌐 OpenAPI


πŸ”Œ Plugins


🀝 A2A Protocol


πŸ”’ Security


πŸ“š Reference


Intelligence, Embedded.

Clone this wiki locally