A Multimodal Dataset for Three-Party Conversations with Speech, Motion, and Gaze
- Installation
- Tools
- Data Capture
- Data Pre-processing
- Data Processing
- Analysis
- Visualization
- Authors
- License and copyright
- Acknowledgements
- python version >=3.8.0
pip install -r requirements.txt
- ViconIQ — Motion capture and motion data processing
- D-Lab — Gaze tracking and audio capture
- Audacity — Audio trimming and channel-level processing
- Praat — Prosodic feature extraction (pitch and intensity)
All modalities are synchronized using a physical clapboard instrumented with motion-capture markers.
The clap event provides a shared temporal reference across motion, gaze, and audio streams.
Motion data are processed using ViconIQ, including:
- Gap interpolation
- Temporal smoothing
A step-by-step demonstration is available in this video tutorial.
Gaze and audio data are exported from D-Lab.
See this video tutorial(Release soon) for the export workflow.
Audio files are processed using Audacity to:
- Trim recordings
- Mute other participants’ voices in each individual audio track
A demonstration is available in this video tutorial.
For detail motion processing, please refer to this document
For detail gaze processing, please refer to this document
For detail audio processing, please refer to this document
- Download the dataset and put inside the Data folder
- Run Jupyter Notebook of example
git clone https://github.com/MCMartinLee/Conversation_Demo
Meng-Chen Lee, mlee45 (at) uh.edu
Zhigang Deng
The scripts are licensed under the MIT license.
In the related C++ module repository, the software is also subject to the MIT license that is provided in the repositories.
This work was supported in part by NSF IIS-2005430. We would like to thank Mai Trinh to help with data capture in this work. We also want to thank the volunteers who participated in the data collection experiments.
If you use this work, please cite the data paper available here.
