Skip to content
/ FIRES Public

[EMNLP 2025 Findings] Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning

License

Notifications You must be signed in to change notification settings

NUSTM/FIRES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning

Fanfan Wang, Xiangqing Shen, Jianfei Yu*, and Rui Xia*

Conference

Emotional Support Conversation (ESC) systems aim to alleviate user distress. However, current Chain-of-Thought based ESC methods often employ rigid, text-only reasoning, limiting adaptability in dynamic, multimodal interactions and introducing reasoning noise that degrades support quality. To address this, we introduce "Flexible Thinking" for multimodal ESC, enabling models to adaptively select contextually relevant thinking aspects: Visual Scene, Emotion, Situation, and Response Strategy. We first construct training data by manually curating flexible thinking demonstrations on the MESC dataset, then using a Multimodal Large Language Model to synthesize these processes for the full training set. Then, we propose FIRES, a framework integrating Supervised Fine-Tuning (SFT) for initial learning with Reinforcement Learning for refinement. This two-stage approach helps FIRES transcend SFT’s generalization limits and, crucially, directly links thinking processes to response quality via tailored rewards, moving beyond imitating potentially imperfect synthetic data. Experiments on MESC and EMOTyDA datasets demonstrate FIRES’s effectiveness and generalizability in fostering higher-quality emotional support responses through adaptive reasoning.

πŸ› οΈ Installation

πŸ“‚ Data Preparation

Based on the timestamps provided by the MESC dataset, we extract video clips from raw episodes of In Treatment via FFmpeg: ffmpeg -i {video_file} -vf select='eq(pict_type\,I)' -vsync vfr -f image2 frame_%d.png. We concatenate consecutive utterances from the same speaker to consolidate their conversational turns, and designate each therapist's turn as the target response for a instance, with preceding utterances serving as the conversation history.

The processed data files for training and inference are located in the ⁠data/ folder.

πŸ‹οΈ Training & Evaluation

πŸ“ Citation

If you find this repo useful in your research, please consider citing:

@inproceedings{wang2025flexible,
  title={Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning},
  author={Wang, Fanfan and Shen, Xiangqing and Yu, Jianfei and Xia, Rui},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  pages={1341--1356},
  year={2025}
}

πŸ™ Acknowledgement

Our implementation benefits from ms-swift, LlamaFactory, ESC, bert_score and EmpGPT-3. We appreciate their valuable contributions.

About

[EMNLP 2025 Findings] Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published