PsyDTCorpus Psychological Counselor Digital Twin Dataset
Date
Size
Publish URL
* This dataset supports online use.Click here to jump.
PsyDTCorpus is a digital twin dataset for psychological counselors launched by the School of Future Technology of South China University of Technology-Guangdong Provincial Key Laboratory of Digital Twins in 2024. The core goal of this dataset is to simulate the language style and consulting techniques of specific psychological counselors to support the development and training of the psychological counselor digital twin model SoulChat2.0.SoulChat: Improving LLMs' Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations".
The PsyDTCorpus dataset targets real multi-round consultation cases of specific psychological counselors. Based on 5k single-round consultation samples, digital twin data synthesis is performed, and finally 5k high-quality mental health dialogue data with the counselor's language style and therapeutic technology application methods are obtained. Among them, 4,760 samples are used as training sets, and 240 samples are split into multiple test samples. The total number of rounds in the dataset is: 90,365, of which the number of rounds in the test set is: 4,311.
This dataset uses an innovative data generation framework that combines the language style, counseling techniques of real counselors and the Big Five personality traits of clients to generate data that simulates a single-round conversation. Through this framework, the research team was able to generate multi-round conversation data that effectively characterizes the language style and counseling techniques of specific counselors. In this project, the total number of multi-round conversation data generated reached 90,365, with an average of 18 rounds per conversation sample.
PsyDTCorpus was manually evaluated and compared in four professional dimensions: conversation technology, state and attitude, relationship building, and therapy technology. The results showed that it has significant improvements in these aspects compared to other datasets, proving the feasibility of using a small number of consultation cases from real psychological counselors to construct high-quality multi-round mental health conversation data.
