Skip to content

talkpl-ai/talkplay-environment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

talkenv

An open-source framework for building and training music agents.

talkenv provides:

  • Dataset loaders for large-scale music corpora (audio, images, lyrics, metadata, attributes) hosted on Hugging Face.
  • Feature extractors for:
    • Audio (CLAP, MusicFM)
    • Images (SigLIP2 cover-art embeddings)
    • Text (Qwen3 embedding models over metadata, attributes and lyrics)
    • Collaborative filtering and playlists (BPR, item2vec)
  • Residual vector quantization (RVQ) training and inference utilities to discretize continuous embeddings into compact semantic codes.

Installation

Requirements

  • Python (>= 3.10)
  • PyTorch (>= 2.7.1) with CUDA (recommended for all heavy jobs)
  • Sufficient disk space for:
    • Raw Spotify-like audio and image data
    • Hugging Face datasets (see below)
    • Extracted embedding .npy files

Install from source

From the repository root:

cd talkenv
pip install -e .

Data prerequisites

Most modules assume the presence of Hugging Face datasets and local file trees modeled after Spotify-like IDs. Typical assumptions:

  • Local raw data layout:
    • Audio: /mnt/hdd/datasets/spotify/src/audio/{track_id[0]}/{track_id[1]}/{track_id}.mp3
    • Images: /workspace/dataset/spotify/images/{track_id[0]}/{track_id[1]}/{track_id}.jpg
  • Local embedding layout (examples):
    • Audio (CLAP): /workspace/dataset/spotify/embs/audio/laion_clap/...
    • Images (SigLIP2): /workspace/dataset/spotify/embs/image/siglip2/...
    • Text (Qwen3) for metadata, lyrics, attributes: /workspace/dataset/spotify/embs/text/{dataset}/qwen3_small/...
    • Collaborative filtering and playlists: /workspace/dataset/spotify/embs/cf_* and /workspace/dataset/spotify/embs/playlist/item2vec/...

You will likely want to adapt these root paths to your own environment before running the scripts.


Package overview

The Python package lives under talkenv/ and is organized into three main sub-packages:

  • talkenv.dataset: dataset loaders that wrap Hugging Face datasets and return PyTorch-style datasets.
  • talkenv.extractor: feature extractors for audio, images, text and collaborative filtering.
  • talkenv.quantizer: training and inference utilities for residual vector quantization (RVQ).

Each module and class has a short docstring describing its main arguments, inputs and outputs.

talkenv.dataset

  • audio.py
    • AudioDataset: iterates over track IDs and resolves them to local audio file paths.
  • image.py
    • ImageDataset: iterates over tracks and resolves them to local cover-art image paths.
  • lyrics.py
    • LyricsDataset: joins the metadata table with a pseudo-lyrics dataset and produces simple prompts of the form "music lyrics: ..." per track.
  • metadata.py
    • MetadataDataset: turns track-level metadata (title, artists, album) into text prompts.
  • attributes.py
    • AttributesDataset: combines tag lists with pseudo annotations (caption, tempo, key, chord) into a structured attribute string.

All of these datasets are compatible with torch.utils.data.DataLoader and typically return (track_id, <text or path>) tuples.

talkenv.extractor

Audio:

  • extractor/audio/clap.py: wraps a LAION-CLAP model and writes 512‑D audio embeddings to disk, one .npy file per track.
  • extractor/audio/musicfm.py: uses a MusicFM / TTMR-like model from mus_embs to extract audio embeddings over batches of audio.

Images:

  • extractor/image/siglip2.py: extracts SigLIP2 image embeddings (e.g. 768‑D) for cover art images and saves them per track.

Text:

  • extractor/text/qwen3_embedding.py: script for extracting Qwen3 embeddings for a single dataset variant (older/experimental).
  • extractor/text/nv_embed_v2.py: main script for extracting Qwen3 embeddings over metadata, attributes, or lyrics datasets and saving them as .npy files.

Collaborative filtering:

  • extractor/cf/bpr.py: trains a BPR model on listening history and exports user/item embeddings.
  • extractor/cf/cornac_dataset.py: local copy of Cornac dataset utilities used by the BPR training script.
  • extractor/cf/item2vec.py: trains an item2vec model on listening history and writes per-track embeddings.
  • extractor/playlist/item2vec.py: trains an item2vec model on playlist sessions and writes per-track embeddings.

talkenv.quantizer

  • quantizer/rvq_train.py: trains a residual vector quantizer (RVQ) on precomputed embeddings for a single modality (audio, image, text, CF), logs metrics, and saves the best checkpoint plus summary JSON.
  • quantizer/rvq_infer.py: loads a trained RVQ model and converts embeddings into discrete code indices, writing the codes to disk.

Typical workflows

1. Prepare local datasets

  1. Download or mount audio files and cover-art images under the expected directory trees.
  2. Make sure the Hugging Face datasets referenced in the code (see above) are available (they will be cached automatically by datasets.load_dataset).

2. Extract embeddings

Examples (all run from the repository root):

  • CLAP audio embeddings
python -m talkenv.extractor.audio.clap

This script:

  • Loads a CLAP model checkpoint from a configured path.

  • Iterates over AudioDataset, which yields (track_id, audio_path) pairs.

  • For each existing audio file, computes a 512‑D embedding and saves it to: /workspace/dataset/spotify/embs/audio/laion_clap/{track_id[0]}/{track_id[1]}/{track_id}.npy.

  • SigLIP2 image embeddings

python -m talkenv.extractor.image.siglip2

This script:

  • Loads the google/siglip2-base-patch16-224 model and its processor.

  • Iterates over ImageDataset, converting each cover image into a 768‑D embedding.

  • Saves .npy files under: /workspace/dataset/spotify/embs/image/siglip2/{track_id[0]}/{track_id[1]}/{track_id}.npy.

  • Qwen3 text embeddings

python -m talkenv.extractor.text.nv_embed_v2 \
  --dataset metadata \
  --emb_path /workspace/dataset/spotify/embs/text \
  --emb_name qwen3_small \
  --device cuda

This script:

  • Loads Qwen3-Embedding-0.6B and its tokenizer.
  • Iterates over MetadataDataset, AttributesDataset or LyricsDataset (depending on --dataset).
  • For each non-empty text, runs the model and saves a single pooled embedding vector per track.

3. Train an RVQ codebook

Once you have continuous embeddings, you can train a residual vector quantizer:

python -m talkenv.quantizer.rvq_train \
  --modality audio \
  --num_quantizers 4 \
  --codebook_size 64 \
  --batch_size 65536 \
  --epochs 20 \
  --save_dir /workspace/talkplay/exp/rvq_audio

This will:

  • Load the specified modality’s embeddings via EmbeddingDataset.
  • Train a ResidualVQ model from vector-quantize-pytorch.
  • Periodically evaluate codebook utilization and commit loss on a test split.
  • Save:
    • Model config: config.yaml
    • Best checkpoint: best.pt
    • Final metrics: final_metrics.json

4. RVQ inference (codes)

After training, you can convert embeddings into discrete code indices:

python -m talkenv.quantizer.rvq_infer \
  --modality audio \
  --num_quantizers 4 \
  --codebook_size 64 \
  --save_dir /workspace/talkplay/exp/rvq_audio \
  --ckpt /workspace/talkplay/exp/rvq_audio/best.pt

This script:

  • Loads the trained ResidualVQ checkpoint.
  • Iterates over the specified dataset splits and modalities.
  • Writes discrete code indices for each track into a configurable subdirectory (e.g. codes/).

About

Vector DB and Semantic IDs for Conversational Music Recommendation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages