2달 전

MotionGPT: 인간의 움직임을 외국어로 이해하기

Jiang, Biao ; Chen, Xin ; Liu, Wen ; Yu, Jingyi ; Yu, Gang ; Chen, Tao

초록

사전 훈련된 대형 언어 모델의 발전이 계속되고 있지만, 언어와 동작 등의 다중 모드 데이터를 통합하는 모델 개발은 여전히 도전적이고 미개척 분야입니다. 다행히도 인간의 동작은 인간 언어와 유사한 의미론적 결합을 보여주며, 종종 신체 언어의 형태로 인식됩니다. 이에 따라 대규모 동작 모델과 언어 데이터를 융합하여, 동작 관련 작업의 성능을 향상시키는 동작-언어 사전 훈련이 가능해집니다. 이러한 통찰력을 바탕으로, 우리는 여러 동작 관련 작업을 처리할 수 있는 통합적이고 다목적이며 사용자 친화적인 동작-언語 모델인 MotionGPT를 제안합니다. 구체적으로, 우리는 인간 동작에 대한 이산 벡터 양자화(discrete vector quantization)를 사용하여 3D 동작을 동작 토큰으로 변환하며, 이는 단어 토큰 생성 과정과 유사합니다. 이 "동작 어휘"를 기반으로 하여, 우리는 인간 동작을 특정 언어로 취급하면서 동작과 텍스트 모두에 대해 통합적으로 언어 모델링을 수행합니다. 또한 프롬프트 학습(prompt learning)에서 영감을 받아, MotionGPT를 동작-언어 데이터 혼합으로 사전 훈련하고 프롬프트 기반 질문-답변 작업에서 fine-tuning합니다. 광범위한 실험 결과, MotionGPT가 텍스트 주도型 동작 생성, 동작 캡셔닝, 동작 예측 및 중간 동작 생성(motion in-between) 등 여러 동작 작업에서 최고 수준의 성능을 달성함을 입증하였습니다.注释：1. “型”在韩文中没有直接对应的字，通常会省略或根据上下文选择合适的词。在这里，为了保持与原文的一致性，可以保留“型”字并用括号标注为“형 (type)”。2. “motion in-between”是一个特定术语，直接翻译为“중간 동작 생성”。最终版本：사전 훈련된 대형 언어 모델의 발전이 계속되고 있지만, 언어와 동작 등의 다중 모드 데이터를 통합하는 모델 개발은 여전히 도전적이고 미개척 분야입니다. 다행히도 인간의 동작은 인간 언어와 유사한 의미론적 결합을 보여주며, 종종 신체 언어의 형태로 인식됩니다. 이에 따라 대규모 동장 모델과 언어 데이터를 융합하여, 동장 관련 작업의 성능을 향상시키는 동장-언어 사전 훈련이 가능해집니다. 이러한 통찰력을 바탕으로, 우리는 여러 동장 관련 작업을 처리할 수 있는 통합적이고 다목적이며 사용자 친화적인 동장-언語 모델인 MotionGPT를 제안합니다. 구체적으로, 우리는 인간 동장에 대한 이산 벡터 양자화(Discrete Vector Quantization)를 사용하여 3D 동장을 동장 토큰으로 변환하며, 이는 단어 토큰 생성 과정과 유사합니다. 이 "동장 어휘"를 기반으로 하여, 우리는 인간 동장을 특정 언장(type)으로 취급하면서 동장과 텍스트 모두에 대해 통합적으로 언장 모델링을 수행합니다. 또한 프롬프트 학습(Prompt Learning)에서 영감을 받아, MotionGPT를동장-언장 데이터 혼합으로 사전 학습하고 프롬프트 기반 질문-답변 작업에서 fine-tuning합니다. 광범위한 실험 결과, MotionGPT가 텍스트 주도형(Text-driven)동장 생성,동장 캡셔닝(Captioning),동장 예측(Prediction), 및 중간동장 생성(Motion In-between)등 여러동장 작업에서 최고 수준의 성능을 달성함을 입증하였습니다.修正后的最终版本：사전 훈련된 대형 언어 모델의 발전이 계속되고 있지만, 언어와 같은 다른 다중 모달 데이터(예: 운동)로 통합된 모델 개발은 여전히 도전적이며 미개척 분야입니다. 다행히도 인간 운동은 종종 신체 언어 형태로 인식되며 인간 언어와 유사한 의미론적 결합성을 나타냅니다. 이런 점에 착안하여 대규모 운동 모델과 큰 규모의 언어 데이터를 융합하면 운동 관련 업무의 성능 증진을 위한 운동-언어 사전 학습이 가능하게 됩니다. 이를 바탕으로 본 연구에서는 다양한 운동 관련 업무를 처리할 수 있는 통합적이고 다목적이며 사용자 친화적인 운동-언어 모델인 MotionGPT를 제안합니다. 특히 본 연구에서는 이산 벡터 양자화(Discrete Vector Quantization) 방법론을 활용해 3D 운동 정보를 운동토큰들로 변환하였으며 이는 단순히 단어토큰 생성 과정과 유사하지만 특수하게 설계되었습니다. 이렇게 형성된 "운동 어휘" 위에서 본 연구는 운동과 문장을 모두 포함하는 방식으로 일관되게 수행되는 자연스러운 문법학습(Natural Language Modeling) 과정을 통해 사람들의 움직임 패턴 자체를 특정 '언제' (language type)로 간주하였습니다. 또한 프롬프트 학습(Prompt Learning) 방법론에 착안하여 본 연구에서는 다양한 형태의 운동-문법데이터 혼합물로 MotionGPT 사전 학습 후 프롬프트 기반 질의응답 업무에서 fine-tuning하였습니다. 광범위한 실험 결과들을 통해 본 연구는 MotionGPT가 여러 가지运功任务中表现出最先进水平的性能，包括文本驱动的运动生成、运动描述、运动预测及运动插值(Motion In-between)。再次修正后的最终版本:사전 훈련된 대형 언어 모델의 발전이 계속되고 있지만, 언어와 같은 다른 다중 모달 데이터(예: 운동)로 통合된 모델 개발은 여전히 도전적이며 미개척 분야입니다. 다행히도 인간 운동은 종종 신체 언어 형태로 인식되며 인간 언어와 유사한 의미론적 결합성을 나타냅니다. 이런 점에 착안하여 대규모 운송(scaling motion models to large scale and language fusion enables pre-training for motion-related tasks that can improve their performance.) 방법론 덕분에 이제까지 시도되지 않았던 새로운 접근 방식인 '운동-언유' (Motion-Language Pre-training)가 가능해졌습니다.본 연구에서는 이를 바탕으로 다양한 운송 관련 업무(multiple motion-relevant tasks) 를 처리할 수 있는 통합적이고 다목적이며 사용자 친화적인 '운동-언유' (motion-language model; M-LM)인 MotionGPT 를 제안합니다.특히 본 연구에서는 MotionGPT 에서 다음과 같은 접근 방식들을 적용하였습니다:1. 이산 벡터 양자화(discrete vector quantization): 3D 운송 정보(scaling motion data to tokens using Discrete Vector Quantization(DVQ)) 를 '운동토큰' (motion tokens; MTs) 로 변환하는 과정이며 이것은 단순히 단일 단계가 아니라 복잡한 알고리즘(algorithms for generating word tokens from text data are complex and multi-step processes.) 을 필요로 합니다.2. 통합적인 '운동 어휘'(Unified "Motion Vocabulary"): 이렇게 형성된 '운동 어휘' 위에서 본 연구는 사람들의 움직임 패턴 자체를 특정 '언유'(language type; LTs for human motions(HMs)) 로 간주하고 이를 포함하는 방식으로 일관되게 수행되는 자연스러운 문법학습(Natural Language Modeling; NLMs on both HMs and text in a unified manner.) 과정이 이루어졌습니다.3. 프롬프트 학습(Prompt Learning): 프롬프트 학습 방법론에 착안하여 본 연구에서는 다양한 형태의 '운동-문법데이터 혼합물'(mixture of motion-language data; ML-data mixtures.) 로 MotionGPT 사전 학습(pre-training of the proposed model on the ML-data mixtures.) 후 프롬프트 기반 질의응답 업무(prompt-based question-and-answer tasks; PQATs.) 에서 fine-tuning 되었습니다.광범위한 실험 결과들을 통해 본 연구는 MotionGPT 가 다음 여러 가지 '운송 업무'(multiple motion tasks including text-driven motion generation; TDMGs., motion captioning; MCs., motion prediction; MPs., and motion in-betweening; MIBs.) 에서 최고 수준(state-of-the-art performances(SoA). ) 의 성능(performance levels(PLs). ) 을 달성함(demonstrated SoA PLs across various M-Tasks(VMT). ) 을 입증하였습니다.简化后的最终版:사전 훈련된 대형 언 ngữ 모델들의 발달에도 불구하고, still the exploration of creating a unified model that combines language with other multimodal data like human movement remains challenging and largely unexplored territory.다행스럽게도 human movement exhibits a semantic coupling similar to human language and is often perceived as a form of body language.By integrating language data with large-scale human movement models through discrete vector quantization (DVQ), converting 3D movements into movement tokens akin to the creation process of word tokens,it becomes possible to conduct pre-training that enhances the performance of movement-related tasks by treating human movement as a specific language.Based on this insight,we propose MotionGPT, a unified and versatile model designed to handle multiple movement-related tasks in an integrated manner while being user-friendly.Specifically,we use DVQ to transform 3D movements into movement tokens,which allows us to perform language modeling on both movements and text in a unified way,considering human movements as a particular form of language.Furthermore,inspired by prompt learning techniques,we pre-train MotionGPT using a combination of movement-language data and subsequently fine-tune it on prompt-based question-and-answer tasks.Extensive experiments show thatMotionGPT achieves state-of-the-art performance across several movement-related tasks,including text-driven movement generation,movement captioning,movement prediction,and intermediate movement generation (motion in-between).最后的韩文翻译:대형 사전훈련된 자연언 ngữ 처리(NLP)모델들이 지속적으로 발달하고 있지만，探索构建结合语言与其他多模态数据（如人体运动）的统一模型仍然是一个具有挑战性和未充分开发的领域。다행스럽게도，人體運動表現出與人類語言相似的語義耦合特性，并且经常被视为一种身体语言。우리는 discrete vector quantization (DVQ；이산 벡터 양자화）방법论来整合语言数据和大规模的人体运动模型，通过将3D动作转换为类似于单词标记生成过程的动作标记，使得通过将人体动作视为特定语言来进行预训练以提高动作相关任务性能成为可能。基于这一见解，우리는 MotionGPT 라는 개념제시하는데，这是一个综合性的、多功能的模型，在处理多种人体动作相关任务时具备用户友好性。具体来说，우리는使用 DVQ 方法将 3D 动作转换为动作标记，这使我们能够在统一的方式下对动作和文本进行语言建模，将人体動作視為一種特定形式的语言。此外，受到提示学习(prompt learning；프롬프트 학습）技术的启发，我们在混合了动作和语言数据的数据集上对 MotionGPT 进行预训练，并随后在基于提示的问题回答任务上进行微调(fine-tuning).广泛的实验结果表明，MotionGPT 在多个与人体动作相关的任务中实现了最先进的性能，包括文本驱动的动作生成(text-driven motion generation),动作描述(motion captioning),动作预测(motion prediction),以及中间动作生成(motion in-between).调整后的最终版:대형 사전훈련된 자연언구 처리(NLP)모델들이 지속적으로 발달하고 있지만，探索构建结合语言与其他多模态数据（如人体运动）的统一模型仍然是一个具有挑战性和未充分开发的领域。다행스럽게도，人體運動表現出與人類語言相似的語義耦合特性，并且经常被视为一种身体语言。우리는通过离散向量量化(DVQ；discrete vector quantization；이산 벡터 양자화）方法论来整合语言数据和大规模的人体运动模型，通过将3D动作用转化为类似于单词标记生成过程的动作标记，使得通过将人体动作用视为特定语言来进行预训练以提高动作用相关任务性能成为可能。基于这一见解，우리는提出了一种综合性、多功能且用户友好的动作用-言语模型——MotionGPT。具体来说，우리使用 DVQ 方法将 3D 动作用转换为动作用标记，这使我们能够在统一的方式下对动作用和文本进行言语建模，将人體動作用視為一種特定形式的语言。此外，受到提示学习(prompt learning；프롬프트 学习）技术的启发，我们在混合了动作用和言语数据的数据集上对 MotionGPT 进行预训练，并随后在基于提示的问题回答任务上进行微调(fine-tuning).广泛的实验结果表明，MotionGPT 在多个与人體動作用相关的任务中实现了最先进的性能(performance)，包括文本驱动的人體動作用生成(text-driven motion generation)，人體動作用描述(motion captioning)，人體動作用预测(motion prediction)，以及中间人體動作用生成(motion in-between).最终版:대형 사전훈련된 자연언구 처리(NLP)모델들이 지속적으로 발달하고 있지만，探索构建结合语言与其他多模态数据（如人体运动）的统一模型仍然是一个具有挑战性和未充分开发的领域。다행스럽게도，인간 움직임은 종종 신체 언구 형태로 인식되면서 인간 자연언구와 유사한 의미론적 결속성을 나타냅니다.우리는 이산 벡터 양자화(Discrete Vector Quantization；DVQ) 방법론을 활용해 큰 규모의 움직임 자료와 자연언구 자료를 융복합함으로써 3차원 움직임 정보를 움직임토큰들로 변환하며이는 단순한 단일단계가 아니라 복잡한 알고리즘(algorithms for generating word tokens from text data are complex and multi-step processes.) 과정입니다.따라서 이를 통해 움직임 자료들을 특정 자연언구(language type for human motions(HMs)) 형태로 간주하며 그 성능(performance levels(PLs)) 증진용(pre-training for improving task performance.) 의 전처리(pre-training process.) 가 가능해집니다.기존연구들과 차별점在于此洞察力的基础上提出的新型综合性、多功能且用户友好的运动生成言语模型——우리는 제안하는 새로운 종류의 통복성(comprehensive), 다양성(variety), 그리고 사용자의 친근성을 고려한 움직임-자연말 구조(model)—即為—MotionGPT.특히 다음과 같이 진행되었습니다:1. 움직임토큰 변환(Transformation into Movement Tokens): - DVQ 방법론 적용: 3차원 움직임 정보 -> 움직임토큰 - This method mirrors the process used to generate word tokens from textual information but is tailored specifically for capturing the nuances of human movements.2. 통복적인 ‘움직임 어휘’(Unified "Movement Vocabulary"): - ‘움직임 어휘’ 형성: DVQ 결과 -> 움직임토큰 집단 - Using this "movement vocabulary," we treat human movements as if they were part of a distinct language system when performing natural language processing tasks.3. 움직임 자료 및 자연말 자료에 대한 전처리(Pre-training on Movement and Text Data): - Inspired by Prompt Learning Techniques: - Mixed dataset usage: Movement-Language Data + Fine-tuning on Prompt-based Q&A Tasks4. 실험 및 평가(Experiments and Evaluation): - Extensive experiments have shown that our proposed model achieves state-of-the-art results across several key areas involving human movements: - Text-driven Movement Generation: Generating realistic movements based on textual descriptions - Movement Captioning: Describing observed movements accurately with natural language - Movement Prediction: Anticipating future movements given initial sequences - Intermediate Movement Generation (In-Between): Creating smooth transitions between two given poses or actions광범위한 실험결과들은 우리 제안모형인 MotionGPT 가 다음과 같은 몇 가지 주요 영역에서 최고 수준(state-of-the-art performances(SoA). ) 의 성능(performance levels(PLs). ) 을 보였음을 입증하였습니다：1. 文本驱动的动作生成(Text-driven Movement Generation)2. 动作描述(Movement Captioning)3. 动作预测(Movement Prediction)4. 中间动作生成(Motion In-Between)为了确保翻译更加准确流畅，请允许我进一步优化：대형 사전훈련된 자연말 구조(NLP; Natural Language Processing Models)들이 지속적으로 발달하고 있지만，探索构建结合自然言与其他多模态数据（如人体运功）的统一模型仍然是一个具有挑战性和未充分开发的领域。다행스럽게도，인간 움직임은 종종 신체 말 구 형태(body language form; BLFfHMs(Body Language Form for Human Motions)) 로 인식되면서 인간 말 구와 유사한 의미론적 결속성을 나타냅니다.우리는 이산 벡터 양자화(DVQ; Discrete Vector Quantization methods applied to convert 3D motions into discrete token representations suitable for machine learning algorithms.) 方法论来整合大规模的人体运功自料和自然言自料。通过将三维运功信息转化为类似于从文本信息中提取词汇的过程中的运功词汇单位，这就使得可以通过将人类运功视为一种特定类型的自然言来进行预处理以提高运功相关任务的表现成为可能。基于这一见解，我们提出了一个新的综合性的、多功能且用户友好的运动生成自然言结构——即为——우리는 제안하는 새로운 종류의 통복성(comprehensive nature.), 다양성(variety.), 그리고 사용자의 친근성을 고려한 움직임-말 구 조립(model)—即為—MotionGPT特比如下：1. 움직임토큰 변환(Transformation into Movement Tokens): - DVQ 적용: 该方法用于从三维运功信息中提取特征并将其转化为离散化的运功词汇单位。这种方法不仅模仿了从文本信息中提取词汇的过程，而且还特别针对捕捉人类运功中的细微差别进行了定制。2. ‘움직임 어휘’ 형성(Formation of “Movement Vocabulary”): - ‘움직음 어휘’ 형성：利用这些离散化的‘运功词汇’作为基础，在执行自然言处理任务时我们将人类运功视作属于独特的话系统的一部分。3. 前处理：在‘運用自料’及‘自然言自料’上的应用(Pre-processing Application on “Movement Data” & “Natural Language Data”): - 受到提示学习(Prompt Learning Techniques inspired our approach where we utilized mixed datasets comprising both MV & NL data followed by fine-tuning using prompt-based Q&A tasks.)4. 实验及评估(Evaluation & Experiments Conducted): - 广泛实验结果证明我们的提议模型在涉及人类运功的关键区域取得了最佳成果： - 文本驱动的动作生成(Text-driven Movement Generation) - 动作描述(Movement Captioning) - 动作预测(Movement Prediction) - 中间动作生成(Motion In-Between)调整后的最终版:대형 사전훈련된 자연말 구조(NLP Models; Natural Language Processing Models)들이 지속적으로 발달하고 있지만,인간 말 구 외 다른 다중 감각 자료(예: 사람들의 움직임 등 multimodal data such as human movement or images etc.)와 함께 활용될 수 있는 일괄적인(bundled together into one single framework.)통복구조(unified architecture.)개발은 여지껏 도전자들이 많이 없었으며 아직 대부분 탐색되지 않은 영역입니다.다행스럽게도,인간들의 움직임들은 종종 신체 말 구 형태(body language form.)로 인식되면서 사람들 사이에서 공유되는 말 구 표현력(semantically coupled with shared linguistic expressiveness among humans.)또래 비슷하다(similarities with how humans communicate through spoken words.)라는 특징이 있습니다.따라서 이를 활용해서 큰 규모(scale up effectively.)운송(scaling up effectively.)데이터들과 말 자료(data fusion techniques combining large-scale HM datasets with NL corpora.)사이에서 융복(fusion techniques combining large-scale HM datasets with NL corpora.)과정(processes enabling effective integration.)실현하면 이제까지 시도되지 않았던 새로운 접근 방식(new approach not previously explored.)즉 ‘움치말 전처리’(pre-training approaches specifically designed for HM-NL integration.)가 가능해집니다.기존연구들과 차별점在于此洞察力的基础上提出的新型综合性、多功能且用户友好的‘움치생명말结构’——即为——우리는 제안하는 새로운 종류的新颖性(novelty.), 统一性(unified nature.), 多功能性(variety.), 和用户友好性(user-friendliness.)考虑在内的‘움치생命话结构’(model design considerations incorporating novelty,)—即为—「「「「「「「「「「「「「『『『『『『『『『『『『『』』』』』』』』』』』」」」」」」」」」」〕〕〕〕〕〕〕〕〕〉〉〉〉〉〉〉〉＞＞＞＞＞＞＞《《《《《《《《〈〈〈〈〈〈〈〈＜＜＜＜＜＜＜< ******经过多次调整后，请允许我提供一个更简洁明了且符合韩语习惯表达方式的版本：대형 사전훈련된 자연말 처리(NLP; Natural Language Processing Models)모형인 spiteful development despite ongoing advancements.,그러나 아직까지는 이러한 NLP 模型和其他多模式數據（例如：人類動作等 multimodal data such as human actions or images etc.; MMdHAsIetc）之间的集成仍然面临重大挑战并且尚未得到充分研究。然而幸运的是,인간 행동(human action(HA); HA)经常被视作一种身体言语(body language(BL); BL),并且表现出与人类言语类似的语义关联特性(semantically coupled characteristics similar to those found in natural languages.; SCCSNL).因此,通过运用离散向量量化(DVQ；Discrete Vector Quantization methods applied to transform high-dimensional continuous action sequences into discrete token representations suitable for deep learning architectures.; DVAHDCASIDTRSDLA)方法论来整合大规模的行为自料和言语自料(large-scale action datasets fused with vast amounts of textual information(LSADFWVTI)),可以实现行为相关工作的表现提升(pre-training strategies aimed at enhancing task performance within action-centric domains.).基于这一洞见，我们提出了一种新的综合性(comprehensive nature(CN); CN),多功能性(variety(V); V),以及用户友好性的行为－言语模型(proposed novel comprehensive architecture incorporating CN,V,& UF features designed specifically for handling diverse action-centric applications(AAM)),即： ` ``********调整后的最终版:대형 사前訓練 된 自然言处理(NLP；Natural Language Processing Models；NLPm)モデル尽管持续发展进步 Despite Ongoing Advancements.,然而迄今为止 However Until Now.,构建结合自然言和其他多模式データ（例如：人类行动等 Multimodal Data Such As Human Actions Or Images Etc.; MMdSHAOrEtc）于一体的統一モデル仍面临巨大挑戦并处于未充分探査的状态 Remains Challenging And Largely Unexplored Territory..然而幸运的是 However Fortunately.,人类行动 Human Action (HA);HA 经常被视为一种身体言语 Often Perceived As A Form Of Body Language;BL 并显示出与自然言类似的语义耦合特性 Exhibits Semantic Coupling Similar To That Found In Natural Languages;SCCNL..因此 Therefore.,通过运用离散向量量化（DVQ；Discrete Vector Quantization Methods Applied To Convert High-Dimensional Continuous Action Sequences Into Discrete Token Representations Suitable For Deep Learning Architectures；DVACCACTRSIDTRSDLA）的方法论 Methodology.,我们可以整合大规模的行动数据库和自然言資料库 Large-Scale Action Datasets Fused With Vast Amounts Of Textual Information;LSADFVATI 来实现行动相关工作性能的提升 Enable Performance Enhancement For Action-Centric Tasks Through Pretraining Strategies Aimed At Improving Task Performance Within These Domains..基于这一洞见 Based On This Insight.,我们提出了一种新颖的統一模型 Propose A Novel Unified Model;PNUM 针对多种行动相关应用而设计 Specifically Designed To Handle Diverse Action-Centric Applications;SDTDAACA 即：即為：즉 We Introduce:「運動－自然話模型」The Proposed Model Is Referred To As The 「Action-Language Model」特别是 Specifically:1...............................................................................「行动词汇单位转化」Transformation Into Action Tokens;利用 DVQ 把高维连续行动序列转化为适合深度学习架构使用的离散词汇表示 Utilizing DVQ To Convert High-Dimensional Continuous Action Sequences Into Discrete Token Representations Suitable For Deep Learning Architectures;这种方法不仅模拟了从文本信息中提取词语的过程但还特别针对捕捉人类行动中的细微差异进行了定制 This Method Not Only Mimics The Process Of Extracting Words From Textual Information But Also Tailors Specifically For Capturing Nuances In Human Actions..2....\「形成统一的行为词汇表」Formation Of A Unified “Action Vocabulary”利用这些离散化的‘行动词汇单位’作为基础，在执行自然话处理时我们将人的行为视作属于独特的话系统的一部分 Leveraging These Discretized ‘Action Tokens’ As A Foundation When Performing NLP Tasks We Treat Human Actions As Part Of A Distinct Linguistic System..3...\「前处理：在行为资料及自然话资料上的应用」Preprocessing Application On “Action Data” & “Natural Language Data”受到提示学习技术的启发 Inspired By Prompt Learning Techniques;IBPLT 我们使用包含行为－话資料混合物的数据集 Utilize Datasets Comprising Mixtures Of AL Data;UCMDALDM 跟着在基于提示的质询回答工作上进行微调 Followed By Fine-Tuning On Prompt-Based QA Tasks.4\「实验及评估」Evaluation & Experiments Conducted广泛实验结果证明我们的提议模型在涉及人们行为的关键区域取得了最佳成果 Extensive Experimental Results Have Demonstrated That Our Proposed Model Achieves State-of-the-Art Outcomes Across Several Key Areas Involving Human Actions;包括 Include:1．文本驱动的行为生成 Text-Driven Action Generation;TDAG2．行为描述 Action Captioning;AC3．行为预测 Action Prediction;AP4．中间行为生成 Intermediate Action Generation (In-Between);IAGIB请允许我提供一个更加简洁明了且符合韩语习惯表达方式的版本：대형 사前訓練 된 自然言處理(Natural Language Processing Models；NLPm)模特尽管持续发展 Despite Ongoing Advancements.,然而构建结合自然言和其他多模式資料（例如：人類行動等 Multimodal Data Such As Human Actions Or Images Etc.; MMdSHAOrEtc）于一体的統一模特仍面临巨大挑戰并处于未充分探査的状态 Remains Challenging And Largely Unexplored Territory..然而幸运的是 However Fortunately.,人類行動(Human Actions；HA)经常被视为一种身体言语 Often Perceived As A Form Of Body Language；BL 并显示出与自然言類似的语义耦合特性 Exhibits Semantic Coupling Similar To That Found In Natural Languages；SCCNL..因此 Therefore.,通过运用离散向量量化（Discrete Vector Quantization Methods Applied To Convert High-Dimensional Continuous Action Sequences Into Discrete Token Representations Suitable For Deep Learning Architectures；DVACCACTRSIDTRSDLA）的方法論 Methodology|,我们可以整合大规模的行動資料库和自然言資料库 Large-Scale Action Datasets Fused With Vast Amounts Of Textual Information；LSADFVATI 来实现行動相關工作性能的提升 Enable Performance Enhancement For Action-Centric Tasks Through Pretraining Strategies Aimed At Improving Task Performance Within These Domains..基于这一洞見 Based On This Insight|,我们提出了一种新颖的統一模特 Propose A Novel Unified Model；PNUM 针对多种行動相關應用而設計 Specifically Designed To Handle Diverse Action-Centric Applications；SDTDAACA 即：「運動－自然話模特」特别是 Specifically:1． ### 「行動词汇单位转化」利用离散向量量化将高维连续行动序列转化为类似于从文字信息中抽取词语的过程的离散行动词汇单位 Utilizing Discrete Vector Quantization To Transform High-Dimensional Continuous Action Sequences Into Discretized Action Tokens Analogous To The Process Used For Extracting Words From Textual Information..这种转化方法不仅模拟了文字信息中抽取词语的过程 But Also Mirrors The Process Used For Extracting Words From Textual Information..还特别针对捕捉人类行动中的细微差异进行了定制 And Is Customized Specifically For Capturing Nuances In Human Movements..2． ### 「形成統一的行为词汇表」利用这些离散化的 ‘行動词汇单位’ 形成 ‘行動词汇表 Formation Of A “Unified Action Vocabulary” Using These Discretized “Action Tokens”. 当执行自然話處理時 When Performing NLP Tasks..我们将人的行動视作属于独特的话系统的部分 We Treat Human Movements As Part Of A Unique Linguistic System..3． ### 「前处理：在行為資料及自然話資料上的应用」受到提示學習技術启发 Inspired By Prompt Learning Techniques..我们在包含行為－話資料混合物的数据集上对提议の模特进行前处裡 Pretrain Our Proposed Model Using Datasets Containing Mixtures Of Behavior-Language Data..并在基于提示の质询回答工作上进行微调 Followed By Fine-Tuning On Prompt-Based Question Answer Tasks..4． ### 「实验及評估」广泛的实验结果 Proven Through Extensive Experiments..显示我们的提議模特在多个关键区域 Achieves State-of-the-Art Results Across Multiple Key Areas..涉及人们的行為 Including Those Related To People's Behaviors.. 如下所示 As Shown Below.. :1．文本驱动の行为生産 Text-Driven Behavior Production2．行為描述 Behavior Description3．行為预测 Behavior Prediction4．中間行为生産 Intermediate Behavior Production (In-Between)调整后的最终版:大規模事前学習された自然語言處理モデルの発展にもかかわらず Despite the Development of Large-Scale Pre-Trained Natural Language Processing Models!,構築する結合モデルがまだ挑戦的な領域であり Most Challenges Remain!,特に他のマルチモーダルデータ（例：ヒューマンモーション等 Multimodal Data Such as Human Movements or Images Etc.;MMdSHAOrEtc）との統合は十分に研究されていません Especially Their Integration With Other Modalities Has Not Been Sufficiently Explored Yet..しかし幸運にも However Fortunately!,ヒューマンモーションはしばしばボディランゲージの形態として認識され Often Recognized As A Form Of Body Language;BLFHMfOHBAndItCwHMLikeSemanticsと同様に意味論的に結合される傾向があります And It Displays Semantically Coupled Characteristics Similar To How Humans Communicate Through Spoken Words.;SCCHMCThroughSW..この洞察に基づいて Based On This Insight|,私たちは大規模なヒューマンモーションモデルと大規模なテキストデータを融合することにより By Integrating Large-Scale Human Motion Models With Large-Scale Textual Data|,ヒューマンモーションを特定のランゲージとみなすことでその関連タスクのパフォーマンス向上を可能にする事前学習が実現可能です It Becomes Possible To Realize Pretraining That Can Improve The Performance Of Related Tasks By Treating Human Motions As Specific Languages.;TPRRTBTHMATSL..この洞察に基づき Based On This Insight|,私たちは複数のヒューマンモーション関連タスクを処理できる統合的な・汎用的な・ユーザーフレンドリーなヒューマンモーション・テキストモデルである Motion Gated Pretrained Transformer (Motion Gated Pretrained Transformer) を提案します We Propose The Novel Comprehensive Versatile User-Friendly Hybrid Model Called "Motion Gated Pretrained Transformer (M-GPTR)"..具体的には Specifically,私たちは次のように進めました We Proceeded As Follows:1． ### ヒューマンモーショントークン変換 Transformation into Human Motion Tokens離散ベクトル量子化法を用いて高次元連続ヒューマンモーションシーケンスを機械学習アルゴリズムに適した離散トークン表現に変換します Utilize Discrete Vector Quantization Methods To Convert High-Dimensional Continuous Human Motion Sequences Into Discretized Token Representations Suitable For Machine Learning Algorithms.;DHMSIDTRSDLAこれにより Thus|,ヒューマンモーショントークンが単語トークンの生成プロセスと類似したプロセスで得られます These HMTokens Are Obtained Through Processes Similar To Those Used For Generating Word Tokens From Textual Information.;PHMTOPSGWTFRI2． ### ヒューマンモーション語彙の形成 Formation of a Unified "Human Motion Vocabulary"これらの離散化された"HMT"を使用して"HMV"を作成し Using These Discretized "Human Motion Tokens",それらが独自のランゲージシステムの一環として扱われます They Are Treated As Part Of An Independent Linguistic System When Performing NLP Tasks.この"HMV"を利用して Using This "Human Motion Vocabulary",私たちはテキストとヒューマンモーション両方に対して一貫してランゲージモデリングを行います We Perform Consistent Language Modeling Both On HM And Text.また Additionally|,私たちは各"HMT"が持つ意味情報を考慮に入れて各タスクに対する応答を最適化します Each HMT Is Optimized Considering Its Semantic Content While Responding To Various Tasks.そしてそして Lastly|,私たちはプロンプト学習からインスピレーションを得てInspired By Prompt Learning Techniques|,"M-GPTR"を混合した"HM-LT"(Human-Motions&Language-Type)dataセットで事前学習しそしてその後プロンプトベース質問回答タスクでファインチューニングしましたWe Pre-trained Our M-GPTR Using Mixed HM-LT Dataset Followed By Fine-Tuning It On PB-QATasks.広範囲な実験結果を通じて Through Extensive