An open-source framework for building and training music agents.
talkenv provides:
- Dataset loaders for large-scale music corpora (audio, images, lyrics, metadata, attributes) hosted on Hugging Face.
- Feature extractors for:
- Audio (CLAP, MusicFM)
- Images (SigLIP2 cover-art embeddings)
- Text (Qwen3 embedding models over metadata, attributes and lyrics)
- Collaborative filtering and playlists (BPR, item2vec)
- Residual vector quantization (RVQ) training and inference utilities to discretize continuous embeddings into compact semantic codes.
- Python (>= 3.10)
- PyTorch (>= 2.7.1) with CUDA (recommended for all heavy jobs)
- Sufficient disk space for:
- Raw Spotify-like audio and image data
- Hugging Face datasets (see below)
- Extracted embedding
.npyfiles
From the repository root:
cd talkenv
pip install -e .Most modules assume the presence of Hugging Face datasets and local file trees modeled after Spotify-like IDs. Typical assumptions:
- Local raw data layout:
- Audio:
/mnt/hdd/datasets/spotify/src/audio/{track_id[0]}/{track_id[1]}/{track_id}.mp3 - Images:
/workspace/dataset/spotify/images/{track_id[0]}/{track_id[1]}/{track_id}.jpg
- Audio:
- Local embedding layout (examples):
- Audio (CLAP):
/workspace/dataset/spotify/embs/audio/laion_clap/... - Images (SigLIP2):
/workspace/dataset/spotify/embs/image/siglip2/... - Text (Qwen3) for
metadata,lyrics,attributes:/workspace/dataset/spotify/embs/text/{dataset}/qwen3_small/... - Collaborative filtering and playlists:
/workspace/dataset/spotify/embs/cf_*and/workspace/dataset/spotify/embs/playlist/item2vec/...
- Audio (CLAP):
You will likely want to adapt these root paths to your own environment before running the scripts.
The Python package lives under talkenv/ and is organized into three main sub-packages:
talkenv.dataset: dataset loaders that wrap Hugging Face datasets and return PyTorch-style datasets.talkenv.extractor: feature extractors for audio, images, text and collaborative filtering.talkenv.quantizer: training and inference utilities for residual vector quantization (RVQ).
Each module and class has a short docstring describing its main arguments, inputs and outputs.
audio.pyAudioDataset: iterates over track IDs and resolves them to local audio file paths.
image.pyImageDataset: iterates over tracks and resolves them to local cover-art image paths.
lyrics.pyLyricsDataset: joins the metadata table with a pseudo-lyrics dataset and produces simple prompts of the form"music lyrics: ..."per track.
metadata.pyMetadataDataset: turns track-level metadata (title, artists, album) into text prompts.
attributes.pyAttributesDataset: combines tag lists with pseudo annotations (caption, tempo, key, chord) into a structured attribute string.
All of these datasets are compatible with torch.utils.data.DataLoader and typically return (track_id, <text or path>) tuples.
Audio:
extractor/audio/clap.py: wraps a LAION-CLAP model and writes 512‑D audio embeddings to disk, one.npyfile per track.extractor/audio/musicfm.py: uses a MusicFM / TTMR-like model frommus_embsto extract audio embeddings over batches of audio.
Images:
extractor/image/siglip2.py: extracts SigLIP2 image embeddings (e.g. 768‑D) for cover art images and saves them per track.
Text:
extractor/text/qwen3_embedding.py: script for extracting Qwen3 embeddings for a single dataset variant (older/experimental).extractor/text/nv_embed_v2.py: main script for extracting Qwen3 embeddings overmetadata,attributes, orlyricsdatasets and saving them as.npyfiles.
Collaborative filtering:
extractor/cf/bpr.py: trains a BPR model on listening history and exports user/item embeddings.extractor/cf/cornac_dataset.py: local copy of Cornac dataset utilities used by the BPR training script.extractor/cf/item2vec.py: trains an item2vec model on listening history and writes per-track embeddings.extractor/playlist/item2vec.py: trains an item2vec model on playlist sessions and writes per-track embeddings.
quantizer/rvq_train.py: trains a residual vector quantizer (RVQ) on precomputed embeddings for a single modality (audio, image, text, CF), logs metrics, and saves the best checkpoint plus summary JSON.quantizer/rvq_infer.py: loads a trained RVQ model and converts embeddings into discrete code indices, writing the codes to disk.
- Download or mount audio files and cover-art images under the expected directory trees.
- Make sure the Hugging Face datasets referenced in the code (see above) are available (they will be cached automatically by
datasets.load_dataset).
Examples (all run from the repository root):
- CLAP audio embeddings
python -m talkenv.extractor.audio.clapThis script:
-
Loads a CLAP model checkpoint from a configured path.
-
Iterates over
AudioDataset, which yields(track_id, audio_path)pairs. -
For each existing audio file, computes a 512‑D embedding and saves it to:
/workspace/dataset/spotify/embs/audio/laion_clap/{track_id[0]}/{track_id[1]}/{track_id}.npy. -
SigLIP2 image embeddings
python -m talkenv.extractor.image.siglip2This script:
-
Loads the
google/siglip2-base-patch16-224model and its processor. -
Iterates over
ImageDataset, converting each cover image into a 768‑D embedding. -
Saves
.npyfiles under:/workspace/dataset/spotify/embs/image/siglip2/{track_id[0]}/{track_id[1]}/{track_id}.npy. -
Qwen3 text embeddings
python -m talkenv.extractor.text.nv_embed_v2 \
--dataset metadata \
--emb_path /workspace/dataset/spotify/embs/text \
--emb_name qwen3_small \
--device cudaThis script:
- Loads Qwen3-Embedding-0.6B and its tokenizer.
- Iterates over
MetadataDataset,AttributesDatasetorLyricsDataset(depending on--dataset). - For each non-empty text, runs the model and saves a single pooled embedding vector per track.
Once you have continuous embeddings, you can train a residual vector quantizer:
python -m talkenv.quantizer.rvq_train \
--modality audio \
--num_quantizers 4 \
--codebook_size 64 \
--batch_size 65536 \
--epochs 20 \
--save_dir /workspace/talkplay/exp/rvq_audioThis will:
- Load the specified modality’s embeddings via
EmbeddingDataset. - Train a
ResidualVQmodel fromvector-quantize-pytorch. - Periodically evaluate codebook utilization and commit loss on a test split.
- Save:
- Model config:
config.yaml - Best checkpoint:
best.pt - Final metrics:
final_metrics.json
- Model config:
After training, you can convert embeddings into discrete code indices:
python -m talkenv.quantizer.rvq_infer \
--modality audio \
--num_quantizers 4 \
--codebook_size 64 \
--save_dir /workspace/talkplay/exp/rvq_audio \
--ckpt /workspace/talkplay/exp/rvq_audio/best.ptThis script:
- Loads the trained
ResidualVQcheckpoint. - Iterates over the specified dataset splits and modalities.
- Writes discrete code indices for each track into a configurable subdirectory (e.g.
codes/).