GitHub - talkpl-ai/talkplay-environment: Vector DB and Semantic IDs for Conversational Music Recommendation

talkenv

An open-source framework for building and training music agents.

talkenv provides:

Dataset loaders for large-scale music corpora (audio, images, lyrics, metadata, attributes) hosted on Hugging Face.
Feature extractors for:
- Audio (CLAP, MusicFM)
- Images (SigLIP2 cover-art embeddings)
- Text (Qwen3 embedding models over metadata, attributes and lyrics)
- Collaborative filtering and playlists (BPR, item2vec)
Residual vector quantization (RVQ) training and inference utilities to discretize continuous embeddings into compact semantic codes.

Installation

Requirements

Python (>= 3.10)
PyTorch (>= 2.7.1) with CUDA (recommended for all heavy jobs)
Sufficient disk space for:
- Raw Spotify-like audio and image data
- Hugging Face datasets (see below)
- Extracted embedding .npy files

Install from source

From the repository root:

cd talkenv
pip install -e .

Data prerequisites

Most modules assume the presence of Hugging Face datasets and local file trees modeled after Spotify-like IDs. Typical assumptions:

Local raw data layout:
- Audio: /mnt/hdd/datasets/spotify/src/audio/{track_id[0]}/{track_id[1]}/{track_id}.mp3
- Images: /workspace/dataset/spotify/images/{track_id[0]}/{track_id[1]}/{track_id}.jpg
Local embedding layout (examples):
- Audio (CLAP): /workspace/dataset/spotify/embs/audio/laion_clap/...
- Images (SigLIP2): /workspace/dataset/spotify/embs/image/siglip2/...
- Text (Qwen3) for metadata, lyrics, attributes: /workspace/dataset/spotify/embs/text/{dataset}/qwen3_small/...
- Collaborative filtering and playlists: /workspace/dataset/spotify/embs/cf_* and /workspace/dataset/spotify/embs/playlist/item2vec/...

You will likely want to adapt these root paths to your own environment before running the scripts.

Package overview

The Python package lives under talkenv/ and is organized into three main sub-packages:

talkenv.dataset: dataset loaders that wrap Hugging Face datasets and return PyTorch-style datasets.
talkenv.extractor: feature extractors for audio, images, text and collaborative filtering.
talkenv.quantizer: training and inference utilities for residual vector quantization (RVQ).

Each module and class has a short docstring describing its main arguments, inputs and outputs.

`talkenv.dataset`

audio.py
- AudioDataset: iterates over track IDs and resolves them to local audio file paths.
image.py
- ImageDataset: iterates over tracks and resolves them to local cover-art image paths.
lyrics.py
- LyricsDataset: joins the metadata table with a pseudo-lyrics dataset and produces simple prompts of the form "music lyrics: ..." per track.
metadata.py
- MetadataDataset: turns track-level metadata (title, artists, album) into text prompts.
attributes.py
- AttributesDataset: combines tag lists with pseudo annotations (caption, tempo, key, chord) into a structured attribute string.

All of these datasets are compatible with torch.utils.data.DataLoader and typically return (track_id, <text or path>) tuples.

`talkenv.extractor`

Audio:

extractor/audio/clap.py: wraps a LAION-CLAP model and writes 512‑D audio embeddings to disk, one .npy file per track.
extractor/audio/musicfm.py: uses a MusicFM / TTMR-like model from mus_embs to extract audio embeddings over batches of audio.

Images:

extractor/image/siglip2.py: extracts SigLIP2 image embeddings (e.g. 768‑D) for cover art images and saves them per track.

Text:

extractor/text/qwen3_embedding.py: script for extracting Qwen3 embeddings for a single dataset variant (older/experimental).
extractor/text/nv_embed_v2.py: main script for extracting Qwen3 embeddings over metadata, attributes, or lyrics datasets and saving them as .npy files.

Collaborative filtering:

extractor/cf/bpr.py: trains a BPR model on listening history and exports user/item embeddings.
extractor/cf/cornac_dataset.py: local copy of Cornac dataset utilities used by the BPR training script.
extractor/cf/item2vec.py: trains an item2vec model on listening history and writes per-track embeddings.
extractor/playlist/item2vec.py: trains an item2vec model on playlist sessions and writes per-track embeddings.

`talkenv.quantizer`

quantizer/rvq_train.py: trains a residual vector quantizer (RVQ) on precomputed embeddings for a single modality (audio, image, text, CF), logs metrics, and saves the best checkpoint plus summary JSON.
quantizer/rvq_infer.py: loads a trained RVQ model and converts embeddings into discrete code indices, writing the codes to disk.

Typical workflows

1. Prepare local datasets

Download or mount audio files and cover-art images under the expected directory trees.
Make sure the Hugging Face datasets referenced in the code (see above) are available (they will be cached automatically by datasets.load_dataset).

2. Extract embeddings

Examples (all run from the repository root):

CLAP audio embeddings

python -m talkenv.extractor.audio.clap

This script:

Loads a CLAP model checkpoint from a configured path.
Iterates over AudioDataset, which yields (track_id, audio_path) pairs.
For each existing audio file, computes a 512‑D embedding and saves it to: /workspace/dataset/spotify/embs/audio/laion_clap/{track_id[0]}/{track_id[1]}/{track_id}.npy.
SigLIP2 image embeddings

python -m talkenv.extractor.image.siglip2

This script:

Loads the google/siglip2-base-patch16-224 model and its processor.
Iterates over ImageDataset, converting each cover image into a 768‑D embedding.
Saves .npy files under: /workspace/dataset/spotify/embs/image/siglip2/{track_id[0]}/{track_id[1]}/{track_id}.npy.
Qwen3 text embeddings

python -m talkenv.extractor.text.nv_embed_v2 \
  --dataset metadata \
  --emb_path /workspace/dataset/spotify/embs/text \
  --emb_name qwen3_small \
  --device cuda

This script:

Loads Qwen3-Embedding-0.6B and its tokenizer.
Iterates over MetadataDataset, AttributesDataset or LyricsDataset (depending on --dataset).
For each non-empty text, runs the model and saves a single pooled embedding vector per track.

3. Train an RVQ codebook

Once you have continuous embeddings, you can train a residual vector quantizer:

python -m talkenv.quantizer.rvq_train \
  --modality audio \
  --num_quantizers 4 \
  --codebook_size 64 \
  --batch_size 65536 \
  --epochs 20 \
  --save_dir /workspace/talkplay/exp/rvq_audio

This will:

Load the specified modality’s embeddings via EmbeddingDataset.
Train a ResidualVQ model from vector-quantize-pytorch.
Periodically evaluate codebook utilization and commit loss on a test split.
Save:
- Model config: config.yaml
- Best checkpoint: best.pt
- Final metrics: final_metrics.json

4. RVQ inference (codes)

After training, you can convert embeddings into discrete code indices:

python -m talkenv.quantizer.rvq_infer \
  --modality audio \
  --num_quantizers 4 \
  --codebook_size 64 \
  --save_dir /workspace/talkplay/exp/rvq_audio \
  --ckpt /workspace/talkplay/exp/rvq_audio/best.pt

This script:

Loads the trained ResidualVQ checkpoint.
Iterates over the specified dataset splits and modalities.
Writes discrete code indices for each track into a configurable subdirectory (e.g. codes/).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
talkenv		talkenv
.gitignore		.gitignore
pyproject.toml		pyproject.toml
readme.md		readme.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

talkenv

Installation

Requirements

Install from source

Data prerequisites

Package overview

`talkenv.dataset`

`talkenv.extractor`

`talkenv.quantizer`

Typical workflows

1. Prepare local datasets

2. Extract embeddings

3. Train an RVQ codebook

4. RVQ inference (codes)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

talkenv

Installation

Requirements

Install from source

Data prerequisites

Package overview

talkenv.dataset

talkenv.extractor

talkenv.quantizer

Typical workflows

1. Prepare local datasets

2. Extract embeddings

3. Train an RVQ codebook

4. RVQ inference (codes)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`talkenv.dataset`

`talkenv.extractor`

`talkenv.quantizer`

Packages