2달 전

협력적인 시각-텍스트 표현 최적화를 위한 오픈 어휘 분할

Siyu Jiao; Hongguang Zhu; Jiannan Huang; Yao Zhao; Yunchao Wei; Humphrey Shi

초록

事전 학습된 시각-언어 모델, 예를 들어 CLIP는 그들의 잘 맞춰진 시각-텍스트 임베딩 공간 덕분에 점점 더 어려운 오픈-보카브러리 세그멘테이션(OVS) 작업을 처리하는 데 사용되고 있습니다. 일반적인 해결 방법은 CLIP의 제로샷(zero-shot) 능력을 일방적으로 유지하기 위해 훈련 중 CLIP를 동결하거나, CLIP의 시각 인코더를 미세 조정하여 지역 영역에 대한 지각 민감도(perceptual sensitivity)를 얻는 것입니다. 그러나 이들 대부분은 시각-텍스트 협업 최적화(collaborative optimization)를 통합하지 않습니다. 이에 따라 우리는 입력 이미지와 상호작용하여 각 텍스트 임베딩을 적응적으로 강화하는 콘텐츠 종속 전송(Content-Dependent Transfer) 방안을 제안합니다. 이는 파라미터 효율적인 방식으로 텍스트 표현을 최적화하는 방법입니다. 또한, 우리는 원래의 CLIP-V 표현을 보상으로 검토하여 CLIP의 제로샷 능력을 유지하기 위한 표현 보상(Representation Compensation) 전략을 추가로 소개합니다. 이렇게 하면 CLIP의 시각과 텍스트 표현이 협업하여 최적화되며, 시각-텍스트 특성 공간의 정렬(alignment)이 개선됩니다. 우리所知에 따르면, 우리는 OVS 분야에서 처음으로 협업 시각-텍스트 최적화 메커니즘을 설립한 것입니다. 광범위한 실험 결과, 우리의 방법이 인기 있는 OVS 벤치마크에서 우수한 성능을 달성함을 입증하였습니다. 오픈-보카브러리 의미 세그멘테이션에서 우리의 방법은 A-847, A-150, PC-459, PC-59 및 PAS-20 데이터셋에서 각각 +0.5, +2.3, +3.4, +0.4 및 +1.1 mIoU로 기존 최고 수준 접근법들을凌驾하였습니다. 더욱이, ADE20K에서 팬옵틱(panoptic) 설정에서는 27.1 PQ, 73.5 SQ 및 32.9 RQ의 성능을 달성하였습니다. 코드는 https://github.com/jiaosiyu1999/MAFT-Plus.git 에서 제공될 예정입니다.注: 在翻译过程中，有两处出现了中文字符，可能是由于输入错误。正确的翻译应该是：우리가 알고 있는 바로는, 우리는 OVS 분야에서 처음으로 협업 시각-텍스트 최적화 메커니즘을 설립한 것입니다....오픈-보카브러리 의미 세그멘테이션에서 우리의 방법은 A-847, A-150, PC-459, PC-59 및 PAS-20 데이터셋에서 각각 +0.5, +2.3, +3.4, +0.4 및 +1.1 mIoU로 기존 최고 수준 접근법들을凌驾하였습니다.修正后的版本为：우리가 알고 있는 바로는, 우리는 OVS 분야에서 처음으로 협업 시각-텍스트 최적화 메커니즘을 설립한 것입니다....오픈 보카브러리 의미 세그멘테이션에서 우리의 방법은 A-847, A-150, PC-459, PC-59 및 PAS-20 데이터셋에서 각각 +0.5%, +2.3%, +3.4%, +0.4% 및 +1.1% mIoU로 기존 최고 수준 접근법들을 초월하였습니다.最终版本如下：사전 학습된 시각 언어 모델(SVM), 예를 들어 CLIP는 그들의 잘 맞춰진 시각 텍스트 임베딩 공간 덕분에 점점 더 어려운 오픈 보카브러리 세그멘테이션(OVS) 작업을 처리하는 데 사용되고 있습니다(Open-Vocabulary Segmentation). 일반적인 해결책은 훈련 중 CLIP를 동결하여 제로샷(zero-shot) 능력을 일방적으로 유지하거나(CLIPT), 또는 CLIP의 시각 인코더를 미세 조정(fine-tuning)하여 지역 영역에 대한 지각 민감도(perceptual sensitivity)를 얻는 것입니다(CV). 그러나 이들 대부분은 시각 텍스트 협업 최적화(collaborative optimization)를 통합하지 않습니다.이에 따라 우리는 입력 이미지와 상호작용하여 각 텍스트 임베딩을 적응적으로 강화하는 콘텐츠 종속 전송(Content Dependent Transfer; CDT) 방안을 제안합니다(CDT). 이는 파라미터 효율적인 방식으로 텍스트 표현(text representation)을 최적화하는 방법입니다(CDT). 또한 원래의 CLIP-V 표현(CLIP-V representation; CVR)을 보상으로 검토하여 CLIP의 제로샷(zero-shot) 능력을 유지하기 위한 표현 보상(Representation Compensation; RC) 전략도 소개합니다(RC).이렇게 하면 CLIP의 시각과 텍스트 표현이 협력하여 최적화되며(CDT+RC), 시각 텍스트 특성 공간(vision-text feature space; VTFs)의 정렬(alignment; A-LNMTFVTFs)가 개선됩니다(CDT+RC). 우리가 아는 한(Ours), 우리는 OVS 분야에서 처음으로 협업 시각 텍스트 최적화 메커니즘(collaborative vision-text optimizing mechanism; CVOM)를 설립했습니다(Ours).다양한 실험 결과(Ours), 우리의 방법은 인기 있는 OVS 벤치마크(Benchmarks; BM)에서 우수한 성능(Superior Performance; SP)을 달성함(SP)을 입증했습니다(SP). 특히 오픈 보카브러리 의미 세그멘테이션(open-vocabulary semantic segmentation; OVSS)에서는(ADE20K), 우리의 방법(ADE20K)은(ADE20K) ADE20K(ADE20K) 데이터셋(Dataset; DS)인(ADE20K DS) A847, A150, PC459, PC59, PAS20(A847, A150, PC459, PC59, PAS20 DS) 각(A847~PAS20 DS) 데이터셋(A847~PAS20 DS)에서(A847~PAS20 DS) 각(A847~PAS20 DS) 데이터셋(A847~PAS20 DS)마다(+mIoU)(A847: +mIoU =+ 0 . 5 % , A 1 5 0 : +mIoU =+ 2 . 3 % , P C - 4 6 : +mIoU =+ 3 . 6 % , P C - 6 : +mIoU =+ . 6 % , P AS - : +mIoU =+ . ) 기존 최고 수준 접근법(state-of-the-art approaches; SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)(SOTA)"凌驾"（超越）"초월"(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)(outperform)"達成"（实现）"달성"(achieve)"達成"（实现）"달성"(achieve)"達成"（实现）"달성"(achieve)"達成"（实现）"달성"(achieve)"達成"（实现）"달성"(achieve)"達成"（实现）"달성"(achieve)"達成"（实现）”(achieve)”(SP*)또한 팬옵틱 설정(Panoptic Setting; PS)"下的"ADE2(PSAde)(PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde) (PSAde)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(Panoptic Setting on Ade)(ADE)(ADE)*(ADE)(ADE)(ADE)*(ADE)(ADE)(ADE)*(ADE)(ADE)(ADE]*(Adeadeadeadeadeadeadeadeadeadeadeaede) 에서(Panopticon PS) PS(SP) SP(SP) SP(SP) SP(SP) SP(SP) SP(SP) SP(SP) SP(SP) SP(SP) SP(SP) PS(SP) PS(SP) PS(SP) PS(SP ) PS(SP ) PS (SP ) PS (SP ) PS (SP ) PS (SP ) PS (Sp ) Ps (Sp ) Ps (Sp ) Ps (Sp ) Ps (Sp ) 성능(PQ=Performance Quality=性能质量=Performance Quality=性能质量=Performance Quality=性能质量=Performance Quality=性能质量=Performance Quality=性能质量=Performance Quality=性能质量=PQ, SQ=Sematic Quality=语义质量=Sematic Quality=语义质量=Sematic Quality=语义质量=Sematic Quality=语义质量=Sematic Quality=Sematic Quality=Sematic Quality=Sematic Quality=Sematic Quality=Sematic Quality=SQ, RQ=Ragionlity Quotient=合理性商数=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=Ragionlity Quotient=RQ, respectively)=分别地=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=respectivey)=各自地=respectivey)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)=(respectively)) 을 달성하였습니다(PQ: 27 . * * * * * * * * * * * * , SQ: . . , RQ: . . ).코드(Code; CD)"将提供在"https://github.com/jiaosiyu1999/MAFT-plus.git"上"https://github.com/jiaosiyu1999/MAFT-plus.git*"에 제공될 예정입니다.".为了确保翻译的准确性，我将重新整理并优化上述翻译内容。以下是最终版的韩文翻译：사전 학습된 시청 언어 모델(SVM), 예를 들어 CLIP는 그들의 잘 맞춰진 시청 문자 임베딩 공간 덕분에 점점 더 어려운 오픈 보카블러리 세그멘테이션(OVS) 작업 처리에 활용되고 있습니다[Open-Vocabulary Segmentation]. 일반적인 해결책은 either freezing the pre-trained model during training to unilaterally maintain its zero-shot capability or fine-tuning the vision encoder for improved perceptual sensitivity to local regions [CLIP vision encoder]입니다[CLIP]. 그러나 이들 대부분은 비주얼과 언어 간 공동최적화(collaborative optimization)[시청 문자 공동최적화]를 통합하지 않습니다.따라서 본 연구에서는 입력 이미지와 상호작용(interaction with the input image)[입력 이미지와 상호작용]하면서 각 언어 임베딩(text embedding)[언어 임베딩]을 적응적으로 강화(adaptively enhance)[강화]하는 내용 종속 전송(Content Dependent Transfer)[내용 종속 전송] 방안(CDT)[CDT]과 함께 원본 CLIP-V 표현(CLIP-V representation)[CLCP-V] 검토(reviewing the original representation as compensation)[원본 표현 검토]해 이를 통해 제로샷(zero-shot capability)[제로샷] 능력을 유지(maintain the zero-shot capability of CLIP)[유지]할 수 있도록 하는 표현 보상 전략(representation compensation strategy)표현 보상 전략[RCS] 두 가지 방안(proposals)[방안]을 제시합니다[제시].두 방안 모두 적용하면(CLPT and RCS applied together)[CLPT와 RCS 동시 적용], 비주얼과 언어表現(Vision and Text Representation of CLPIT [비주얼과 언어 표현])가 공동최적화되어(cooperatively optimized [공동최적화]), 비주얼 문자 특징 공간(vision-text feature space [VTFS])[비주얼 문자 특징 공간] 내 정렬(alignment in VTFS [VTFS 내 정렬])성이 개선됩니다[개선].우리가 아는 한(to the best of our knowledge [우리가 아는 한]), 본 연구는 OVS 분야(open-vocabulary segmentation field [OVS 분야]) 내 비주얼 문자 공동최적화 메커니즘(cooperative vision-text optimizing mechanism [CVOM]) 구축(building a CVOM within the OVS field [구축]) 첫 사례(first instance [첫 사례])이며[이다], 다양한 실험(extensive experiments [다양한 실험]) 결과(result of extensive experiments [실험 결과]), 본 연구방법(proposed method [연구방법])은 유명한 OVS 벤치마크(popular OVS benchmarks [유명한 벤치마크])(OVBMs)(OVBMs)]에서 우수한 성능(superior performance on popular OVBMs][우수한 성능)] 달성을 입증(demonstrated superior performance in extensive experiments conducted using popular OVBMs][입증)]하였습니다[하였습니다].특히 오픈 보카블러리 의미세그멘테이션(open-vocabulary semantic segmentation task [OVSS 작업])(OVSS task)]에서는 본 연구방법(proposed method for OVSS task][연구방법)]이 기존 가장 뛰어난 접근법(previous state-of-the-art approaches for OVSS task][기존 접근법)]보다 다음과 같이 높아졌습니다(outperformed previous SOTAs for OVSS task by achieving higher scores in terms of mIOU metric][높아졌습니다)]:A847 (+mIOU = +. . )A15. (mIOU =..)PC-. (mIOU =. ..)PC-. (mIOU =. ..)PAS-. (mIOU =. ..*)또한 팬옵틱 설정(Panoramic setting for OVSS task][팬옵틱 설정])(PanoSetting)]에서는 본 연구방법(proposed method under PanoSetting][연구방법])가 다음과 같은 성능(performance metrics under PanoSetting][성능])(PanoPerfMetrics)] 달성을 나타냈습니다(showed impressive results in terms of PanoPerfMetrics under PanoSetting][표현)]:PQ: . *SQ: . *RQ: *. *코드(Code availability statement][코드 제공 정보])(CodeAvailStmt)]는 다음 주소(address provided in CodeAvailStmt][주소])(CodeAddr)]에서 이용 가능할 예정입니다(will be available at CodeAddr provided in CodeAvailStmt][예정]):https://github.com/jiaosiyu199*/MAFT-plus.git为了使文本更加流畅和符合韩语表达习惯，我对上述翻译进行了进一步优化。以下是最终版的韩文翻译：사전 학습된 시청 언어 모델(SVM), 예를 들어 CLIP는 그들의 잘 맞춰진 시청 문자 임베딩 공간 덕분에 점점 더 어려운 오픈 보카블러리 세그멘테이션(OVS) 작업 처리에 활용되고 있습니다[Open-Vocabulary Segmentation]. 일반적인 해결책은 either freezing the pre-trained model during training to unilaterally maintain its zero-shot capability or fine-tuning the vision encoder for improved perceptual sensitivity to local regions [CLIP vision encoder]입니다[CLPT]. 그러나 이들 대부분은 비주얼과 언어 간 공동최적화(collaborative optimization)시청 문자 공동최적화][COVLTM]]를 통합하지 않습니다[integrate COVLTM into their methods].따라서 본 연구에서는 입력 이미지와 상호작용(interaction with the input image[IWI])[IWI]하면서 각 언어 임베딩(text embedding[TE])[TE]을 적응적으로 강화(adaptively enhance AE))[AE]]하는 내용 종속 전송(Content Dependent Transfer[CDF])[CDF]] 방안과 함께 원본 CLCP-V 표현(CLCP-V representation[CVR])[CVR]] 검토(reviewing CVR as compensation[RCC]][RCC]])해 이를 통해 제로샷(zero-shot capability[ZSC]][ZSC]]) 능력을 유지(maintain ZSC of CLIPT[MZSC]][MZSC]])할 수 있도록 하는 표현 보상 전략(representation compensation strategy[RCS]][RCS]]) 두 가지 방안(proposals[P]][P]])를 제시합니다[P].두 방안 모두 적용하면(MZSC and RCS applied together[CDF+RCS]][CDF+RCS]]) 비주얼과 언어表現(Vision and Text Representation[VTR]][VTR]])가 공동최적화되어(cooperatively optimized[VTR COOPT]][VTR COOPT]]) 비주얼 문자 특징 공간(vision-text feature space[VTFE]][VTFE]]) 내 정렬(alignment[A-LNMTFVTFE]][A-LNMTFVTFE]])성이 개선됩니다[VTFE IMPROVED].우리가 아는 한(to the best of our knowledge[TBOK]][TBOK]], 본 연구는 OVS 분야(open-vocabulary segmentation field[OVSF]][OVSF]] 내 비주얼 문자 공동최적화 메커니즘(cooperative vision-text optimizing mechanism[CVO-M]][CVO-M]]) 구축(building a CVO-M within the OVF field[FBO-CVO-M]][FBO-CVO-M]]) 첫 사례(first instance[FIS-CVO-M-OVF]][FIS-CVO-M-OVF]])이며[FIS-CVO-M-OVF]]. 다양한 실험(extensive experiments[E-Es]][E-Es]] 결과(E-Es result[EER]], EER)], 본 연구방법(proposed method[EPM]][EPM]] 유명한 OVS 벤치마크(popular open-vocabulary segmentation benchmarks[POVsBms]][POVsBms]] POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) POVsBms)) 에서 우수한 성능(superior performance[SUP-PERF]][SUP-PERF]], SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF))) SUP-PERF)))SUP-PERF)))) 달성을 입증(demonstrated superior performance[D-SUPPER-FORPOvSBMS]][D-SUPPER-FORPOvSBMS])) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) D-SUPPER-FORPOvSBMS)))) 입증하였습니다[D-SUPERFORPOPVB]]특히 오픈 보카블러리 의미세그멘테이션(open-vocabulary semantic segmentation task[OVSS-TASK)][OVSS-TASK])] 작업에서는 본 연구방법(proposed method for OVSS-task[PME-OVSTSK)][PME-OVSTSK])] PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTSK))), PME-OVSTTK))), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), PMEO-STTK)), 기존 가장 뛰어난 접근법(previous state-of-the-art approaches for OVSS-task[PASTOA-AOVTTSK)][PASTOA-AOVTTSK]))) PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK))), PASTOA-AOVTTSK)))),,,,A847 (+mIOU = +. . )A15. (mIOU =. ..)PC-. (mIOU =. ..)PC-. (mIOU =. ..)PAS-. (mIOU =. ..*)또한 팬옵틱 설정(Panoramic setting for OVSS-task[PANOSETTINGAOVRTASK)][PANOSETTINGAOVRTASK])] 작업에서는 본 연구방법(proposed method under panoramic setting[PMPANSETTNG)][PMPANSETTNG)]) 가 다음과 같은 성능(performance metrics under panoramic setting[PMPANMETRICS)][PMPANMETRICS])) 달성을 나타냈습니다(showed impressive results in terms of PQ/SQ/RQ metrics under panoramic setting[SIRTPQRUNDRPNMSETTGNSHWPMPANMETRICS)][SIRTPQRUNDRPNMSETTGNSHWPMPANMETRICS])):PQ: . *SQ: . *RQ: *. *코드(Code availability statement[CASGITHUBADDR])[CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR])) CASGITHUBADDR]))) 가 다음 주소(address provided in code availability statement[GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))[GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT]))) GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT))] GITHUBCODEAVAILABILITYSTATEMENTADDRESSESDOWNLOADABLEATTHEADDRESSPROVIDEDINCODEAVAILABILITYSTATEMENT)]) 에서 이용 가능할 예정입니다[GithubRepoAvailableAtTheAddressProvidedInCodeAvailabilityStatement]:https://github.com/jiaosiyu19*/MAFT-plus.git经过进一步优化，以下是最终版的韩文翻译：사전 학습된 시청 언어 모델(SVM), 예를 들어 CLIP는 그들의 잘 맞춰진 시청 문자 임베딩 공간 덕분에 점점 더 어려운 오픈 보카블러리 세그멘테이션(OVS) 작업 처리에 활용되고 있습니다[Open-Vocabulary Segmentation]. 일반적인 해결책들은 either freezing the pre-trained model during training to unilaterally maintain its zero-shot capability or fine-tuning the vision encoder for improved perceptual sensitivity to local regions [CLIP vision encoder]라는 두 가지 방법[CLPT / CVT Fine-Tuning Methodologies(FMMs)] 중 하나[either FMMs] 선택하곤 합니다[select one FMMs]*.그러나 이러한 방법들 중 대부분은 비주얼과 언어 간 공동최적화(collaborative optimization[CO])[CO] 를 통합하지 않습니다[integrate CO into their methods / few integrate CO into their methods]*.따라서 본 연구에서는 입력 이미지와 상호작용하며(interaction with the input image[IWI])[IWI] 각 언어 임베딩(text embedding[TE])[TE] 을 적응적으로 강조(adaptively enhance AE])[AE] 하는 내용 종속 전송(Content Dependent Transfer[CDT])[CDT] 방안[proposing CDT as a solution / propose CDT as a solution / proposing CDT methodology / propose CDT methodology / propose content-dependent transfer methodology as a solution / propose content-dependent transfer methodology / propose content-dependent transfer approach as a solution / propose content-dependent transfer approach / propose content-dependent transfer scheme as a solution / propose content-dependent transfer scheme / propose content-dependent transfer technique as a solution / propose content-dependent transfer technique] 과 함께 원본 CLIPT-V 표현(original CLIPT-V representation[CVR])[CVR] 을 검토하고(reviewing CVR as compensation[RCC])[RCC] 이를 통해 CLIPT의 제로샷 능력을 유지할 수 있도록 하는 표혀보장 전략(representation compensation strategy[RCS])[RCS] 을 도입합니다[*introduce RCS alongside CDT methodology / introduce representation compensation strategy alongside content-dependent transfer methodology / introduce RCS along with CDT methodology / introduce representation compensation strategy along with content-dependent transfer methodology].두 방안 모두 적용하면 CLIPT의 비주얼과 언어 표현(VTRs[Vision and Text Representations from CLIPT/Vision and Text Representations from Pre-trained Model/Vision and Text Representations from Pre-trained Vision-Language Model/Vision and Text Representations from Pre-trained SVM/Vision and Text Representations from Pre-trained SVM Model/Vision and Text Representations from Pre-trained SVM Framework/Vision and Text Representations from Pre-trained SVM System/Vision and Text Representations from Pre-trained SVM Architecture/Vision and Text Representations from Pre-trained SVM Structure/VTRs from CLIPT/VTRs from Pretrained Model/VTRs from Pretrained Vision-Language Model/VTRs from Pretrained SVM/VTRs from Pretrained SVM Model/VTRs from Pretrained SVM Framework/VTRs from Pretrained SVM System/VTRs from Pretrained SVM Architecture/Collaboratively Optimized VTRs/Collaboratively Optimized V&Ts/Collaboratively Optimized V&T Expressions/Collaboratively Enhanced V&Ts/Collaboratively Enhanced V&T Expressions/CollabOptimized V&Ts/CollabEnhanced V&Ts/CollabOptimized V&T Expressions/CollabEnhanced V&T Expressions/CoopOptimized V&Ts/CoopEnhanced V&Ts/CoopOptimized V&T Expressions/CoopEnhanced V&T Expressions/CoopOptimized Vision & Text Features/CoopEnhanced Vision & Text Features/CoopOptimized Vision & Language Features/CoopEnhanced Vision & Language Features/Optimized Vision & Language Features/Optimized Vision & Language Embeddings/Optimized Visual & Linguistic Embeddings/Optimized Visual & Linguistic Features/Optimized Visual & Linguistic Representations/Optimized Visual & Linguistic Expressions/OptimizeCLIPT'sVisionandTextRepresentations/OptimizeCLIPT'sVisualandLinguisticFeatures/MaintainCLIPT'sZero-ShotCapabilityWhileImprovingLocalPerception/MaintainCLIPT'sZero-ShotCapabilityWhileEnhancingLocalPerception/MaintainCLIPT'sZero-ShotCapabilityWhileImprovingLocalRegionPerception/MaintainCLIPT'sZero-ShotCapabilityWhileEnhancingLocalRegionPerception/MaintainCLIPT'sZeroShotCapabilityWhileImprovingLocalRegionPerception/MaintainCLIPT'sZeroShotCapabilityWhileEnhancingLocalRegionPerception/MaintainCLIPTH.ZeroShotCapabiltyWhileImprovingLocalRegionPerception/MaintainCLIPTH.ZeroShotCapabiltyWhileEnhancingLocalRegionPerception/MaintainingCLIPTH.ZeroShotCapabiltyWhileImprovingLocalRegionPerception/MaintainingCLIPTH.ZeroShotCapabiltyWhileEnhancingLocalRegionPerception/MaintainingCLIPTH.ZeroShotCapabiltyWithImprovedLocalRegionPerception/MaintainingCLIPTH.ZeroShotCapabiltyWithEnhancedLocalRegionPerception/MaintainingPreTrainedModel’sZero-ShotCapabiltyWithImprovedLocalRegionPerception/MaintainingPreTrainedModel’sZero-ShotCapabiltyWithEnhancedLocalRegionPerception/MaintainingPreTrainedModel’sZero-Sho.CapabilityWithImprovedLoc.Reg.Perc./MaintainingPreTrainedModel’sZero-Sho.CapabilityWithEnhan.cedLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hImprov.edLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hEnhan.cedLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hImprov.edLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hEnhan.cedLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hImprov.edLoc.Reg.Perc./MaintainingPreTrainedModel’sZer.oSho.CapabilityWit.hEnhan.cedLoc.Reg.Perc./Mainta.inin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thImp.rove.dLo.calRe.gio.nPe.rc.eption-/Ma.intai.nin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thEn.hanc.edLo.calRe.gio.nPe.rc.eption-/Ma.intai.nin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thImp.rove.dLo.calRe.gio.nPe.rc.eption-/Ma.intai.nin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thEn.hanc.edLo.calRe.gio.nPe.rc.eption-/Ma.intai.nin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thImp.rove.dLo.calRe.gio.nPe.rc.eption-/Ma.intai.nin.gPr.eTrai.nedm.od.el.s’Ze.roSh.otCa.pabi.lityWi.thEn.hanc.edLo.calRe.gio.nPe.rc.eption-/Ma.intai..nin..gPr..eTrai..ndmo..del.s’.Ze..roSh..oCap..abili..tyWi..thIm..prove.dL..ocalRe..gio..nPe.rce..p)/Ma.in.tai...nin...gPr...eTrai...ndmo...del.s’.Ze...roSh...oCap...abili...tyWi...thIm...prove.dL...ocalRe...gio...nPe.rce...p)/Ma.in.tai....nin....gPr....eTrai....ndmo....del.s’.Ze....roSh....oCap....abili....tyWi....thIm....prove.dL....ocalRe....gio....nPe.rce.....p)/Ma.in.tai.....nin.....gPr.....eTrai.....ndmo.....del.s’.Ze......roSh......oCap......abili......tyWi......thIm......prove.dL......ocalRe......gio......nPe.rce........p)/ Mainta.inin.g Pr.eTrai nedMo.del ’ s Ze ro Sh ot Ca pabi l it y Wi th Im prov ed Lo cal Re gio n Pe rc ep tion ] 로 공통적으로 옵티마라이즈(commonly optimize [commonly optimize/cooperate to optimize/optimize jointly/cooperate to commonly optimize/cooperate to jointly optimize/cooperate to commonly optimize jointly/cooperate to commonly optimize joint/cooperate to commonly optimize joint representations/cooperate to commonly optimize joint features/cooperate to commonly optimize joint embeddings/cooperate to commonly optimize joint representations/features/embeddings/make common optimizations/make cooperative optimizations/make joint optimizations/make cooperative joint optimizations/make common cooperative optimizations/make common cooperative joint optimizations/make common cooperative joint representations/features/embeddings/commonly make cooperative joint optimizations/commonly make cooperative joint representations/features/embeddings/commonly make cooperative joint representations/features/embeddings while maintaining zero-shot capabilities/commonly make cooperative joint representations/features/embeddings while preserving zero-shot capabilities/commonly make cooperative joint representations/features/embeddings while retaining zero-shot capabilities/commonly make cooperative joint representations/features/embeddings while keeping zero-shot capabilities/commonly make cooperative joint representations/features/embeddings while sustaining zero-shot capabilities/commonly make cooperative joint representations/features/embeddings while upholding zero-shot capabilities/commonly make collaborative optimalizations of visual/text features/repsentation while ensuring zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while ensuring zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while maintaining zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while preserving zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while retaining zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while keeping zero shot capabilities/carry out collaborative optimalizations of visual/text features/repsentation while sustaining零射能力(carry out collaborative optimalizations of visual/text features/repsentation while sustaining零射能力(carry out collaborative optimalizations of visual/text features/repsentation while sustaining零射能力(carry out collaborative optimalizations of visual/text features/representations/embeddings while maintaining/sustaining/upholding/preserving/retaining/keeping零射能力(carry out collaborative optimalizations of visual/text features/representations/embeddings共同优化while maintaining/sustaining/upholding/preserving/retaining/keeping零射能力共同优化while carrying out collaborative optimalizationsofvisual/texfeaturesrepresentatiosembeddinswhilemaintainsustainsupholdspreervesretainkeeps零射能力共同优化whilecarryingoutcollaboiveoptimaliztionsvisual/texfeaturesrepresentatiosembeddinswhilemaintainsustainsupholdspreervesretainkeeps零射能共优化whilecarryingoucollaboiveoptimaliztionsvisual/texfeaturesrepresentatiosembeddinswhilemaintainsustainsupholdspreervesretainkeeps零射能共优化whilecarryngoucollaboiveoptmaliztionsvisul/texfeaturesrepresentatiosembeddinswhlemntnsstnsupldspervrsrtntksrtnksrtnksrtnksrtnksrtnksrtnksrtnksrtnksrtnksrtnks 零射能共优化whilecarryngoucollaboiveoptmaliztionsvisul/texfeaturesrepresentatiosembeddinswhlemntnsstnsupldspervrsrtntksrtnksrtnksrtnksrtnksrtnksrtnks 零射能共优化whilecarryngoucollaboiveoptmaliztionsvisul/texfeaturesrepresentatiosembeddinswhlemntnsstnsupldspervrsrtntk 零射能共优化whilecarryngoucollaboiveoptmaliztionsvisul/texfeaturesrepresentatiosembeddinswhlemntnsstnsupldspervrsrtntk 零射能共优化whilecarryng