Fanfan Wang, Xiangqing Shen, Jianfei Yu*, and Rui Xia*
Emotional Support Conversation (ESC) systems aim to alleviate user distress. However, current Chain-of-Thought based ESC methods often employ rigid, text-only reasoning, limiting adaptability in dynamic, multimodal interactions and introducing reasoning noise that degrades support quality. To address this, we introduce "Flexible Thinking" for multimodal ESC, enabling models to adaptively select contextually relevant thinking aspects: Visual Scene, Emotion, Situation, and Response Strategy. We first construct training data by manually curating flexible thinking demonstrations on the MESC dataset, then using a Multimodal Large Language Model to synthesize these processes for the full training set. Then, we propose FIRES, a framework integrating Supervised Fine-Tuning (SFT) for initial learning with Reinforcement Learning for refinement. This two-stage approach helps FIRES transcend SFTβs generalization limits and, crucially, directly links thinking processes to response quality via tailored rewards, moving beyond imitating potentially imperfect synthetic data. Experiments on MESC and EMOTyDA datasets demonstrate FIRESβs effectiveness and generalizability in fostering higher-quality emotional support responses through adaptive reasoning.
Based on the timestamps provided by the MESC dataset, we extract video clips from raw episodes of In Treatment via FFmpeg: ffmpeg -i {video_file} -vf select='eq(pict_type\,I)' -vsync vfr -f image2 frame_%d.png. We concatenate consecutive utterances from the same speaker to consolidate their conversational turns, and designate each therapist's turn as the target response for a instance, with preceding utterances serving as the conversation history.
The processed data files for training and inference are located in the β data/ folder.
- Backbone: Qwen2.5-VL-7B-Instruct
If you find this repo useful in your research, please consider citing:
@inproceedings{wang2025flexible,
title={Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning},
author={Wang, Fanfan and Shen, Xiangqing and Yu, Jianfei and Xia, Rui},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
pages={1341--1356},
year={2025}
}
Our implementation benefits from ms-swift, LlamaFactory, ESC, bert_score and EmpGPT-3. We appreciate their valuable contributions.