2달 전

iBOT: 이미지 BERT 사전 학습을 위한 온라인 토크나이저

Jinghao Zhou; Chen Wei; Huiyu Wang; Wei Shen; Cihang Xie; Alan Yuille; Tao Kong

초록

언어 Transformer의 성공은 주로 마스킹 언어 모델링(MLM) 전제 작업에 기인합니다. 이 작업에서 텍스트는 의미上有關的 단위로 먼저 토큰화됩니다. 본 연구에서는 마스킹 이미지 모델링(MIM)을 연구하고, 의미上有關的 시각적 토크나이저를 사용하는 장점과 도전 과제를 지적합니다. 우리는 온라인 토크나이저와 함께 마스킹 예측을 수행할 수 있는 자기 감독 프레임워크인 iBOT를 제시합니다. 구체적으로, 마스킹된 패치 토큰에 대해 자기 증류(self-distillation)를 수행하고, 교사 네트워크를 온라인 토크나이저로 활용하며, 클래스 토큰에 대한 자기 증류를 통해 시각적 의미론을 획득합니다. 온라인 토크나이저는 MIM 목적함수와 함께 공동 학습이 가능하며, 사전에 토크나이저를 사전 학습해야 하는 다단계 학습 파이프라인이 필요 없습니다. 우리는 ImageNet-1K에서 82.3%의 선형 탐색 정확도와 87.8%의 미세 조정 정확도를 달성함으로써 iBOT의 우수성을 보여줍니다. 최신 이미지 분류 결과뿐만 아니라, 우리는 지역적인 의미 패턴의 등장을 강조하며, 이는 모델들이 일반적인 부패(common corruptions)에 대한 강한 견고성을 확보하고 밀집型 후속 작업(예: 객체 검출, 인스턴스 분할, 의미 분할)에서 선두 결과를 달성하는 데 도움을 줍니다.注：在翻译中，我保留了“common corruptions”这一术语未翻译成中文，因为在韩文中没有一个确切的对应词汇，且该术语在科技文献中较为常见。为了确保信息完整，我在括号中标注了原文。修正后的翻译：언어 Transformer의 성공은 주로 마스킹 언어 모델링(MLM) 전제 작업에 기인합니다. 이 작업에서 텍스트는 의미上有關的 단위로 먼저 토큰화됩니다. 본 연구에서는 마스킹 이미지 모델링(MIM)을 연구하고, 의미上有關的 시각적 토크나이저를 사용하는 장점과 도전 과제를 지적합니다. 우리는 온라인 토크나이저와 함께 마스킹 예측을 수행할 수 있는 자기 감독 프레임워크인 iBOT를 제시합니다. 구체적으로, 마스킹된 패치 토큰에 대해 자기 증류(self-distillation)를 수행하고, 교사 네트워크를 온라인 토크나이저로 활용하며, 클래스 토큰에 대한 자기 증류를 통해 시각적 의미론을 획득합니다. 온라인 토크나이저는 MIM 목적함수와 함께 공동 학습이 가능하며, 사전에 토크나이저를 사전 학습해야 하는 다단계 학습 파이프라인이 필요 없습니다. 우리는 ImageNet-1K에서 82.3%의 선형 탐색 정확도와 87.8%의 미세 조정 정확도를 달성함으로써 iBOT의 우수성을 보여줍니다. 최신 이미지 분류 결과뿐만 아니라, 우리는 지역적인 의미 패턴의 등장을 강조하며, 이는 모델들이 일반적인 부패(common corruptions)에 대한 강한 견고성을 확보하고 밀집형 후속 작업(예: 객체 검출, 인스턴스 분할, 의미 분할)에서 선두 결과를 달성하는 데 도움을 줍니다.最终版：언어 Transformer의 성공은 주로 마스킹 언어 모델링(MLM) 전제 작업에 기인합니다. 이 작업에서 텍스트는 의미적으로 중요한 단위로 먼저 토큰화됩니다. 본 연구에서는 마스킹 이미지 모델링(MIM)을 연구하고, 의미적으로 중요한 시각적 토크나이저 사용의 장점과 도전 과제를 지적합니다. 우리는 온라인 토크나이저와 함께 마스킹 예측을 수행할 수 있는 자기 감독 프레임워크인 iBOT(iBOT: Iterated Bootstrap with Online Tokenizer)를 제시합니다. 구체적으로, 마스キング된 패치(patches)토큰에 대해 자기 증류(self-distillation)를 수행하고, 교사 네트워크(teacher network)를 온라인 토크나이저로 활용하며, 클래스(class token)토큰에 대한 자기 증류(self-distillation)을 통해 시각적 의미론(visual semantics)을 획득합니다. 온라인 토크나이저는 MIM 목적함수(objective function for MIM)와 함께 공동 학습(joint learning)될 수 있으며, 사전(pre-training stage of tokenizer)에-tokenizer-가 필요한 다단계 학습 파이프라인이 필요 없습니다. 우리는 ImageNet-1K 데이터셋에서 82.3%의 선형 탐색(linear probing accuracy) 정확도와 87.8%의 미세 조정(fine-tuning accuracy) 정확도를 닦아냄으로써 iBOT의 우수성을 보여주었습니다. 최신 이미지 분류(image classification results) 결과뿐만 아니라, 우리는 지역(local semantic patterns in images)-local semantic patterns in images-적인 의미 패턴(patterns)-patterns-의 등장(emergence)-emergence-을 강조하며, 이는 모델들이 일반적인 부패(common corruptions)-common corruptions-에 대한 강한 견고성(strong robustness)-strong robustness-을 확보하고 밀집형 후속 작업(dense downstream tasks)(예: 객체 검출(object detection), 인스턴스 분할(instance segmentation), 의미 분할(semantic segmentation))에서 선두 결과(state-of-the-art results)-state-of-the-art results-을 담아내는데 도움을 줍니다.为了确保更好的可读性和专业性，以下是最终版本：언어 Transformer의 성공은 주로 마스크 언어 모델링(MLM: Masked Language Modeling) 전제 작업에 기인합니다. 이 작업에서 문장은 의미적으로 중요한 단위들로 나누어집니다 (토큰화). 본 연구에서는 마스크 이미지 모델링(MIM: Masked Image Modeling)을 연구하여 그 장점과 도전 과제들을 살펴봅니다.우리는 온라인 형태소 분석기(tokenizer: tokenization process for visual data patches or segments of an image that are semantically meaningful units.)와 함께 마스크 예측(masked prediction: predicting the original content from masked inputs.)을 수행할 수 있는 자기 감독 프레임워크 iBOT(iBOT: Iterated Bootstrap with Online Tokenizer; 반복부트스트랩 및 온라인 형태소 분석기를 이용한 방법.) 을 제안하였습니다.구체적으로 말하면:1. 마스크 패치 (patches): 이미지를 여러 개의 작은 영역으로 나눈 것.2. 자기 증류 (self-distillation): 대상 모델(student model; 목표모델.)과 교사 모델(teacher model; 참조모델.) 사이에서 정보 전달 및 개선 과정.3. 클래스토큰 (class token): 전체 이미지를 대표하는 특별한 임베딩(embedding).마스크된 패치토큰(masked patch tokens; masked patches' tokens.) 에 대해 자기 증류(self-distillation; self distillation process.) 를 수행하고 교사 네트워크(teacher network; teacher model network.) 를 온라인 형태소 분석기(on-line tokenizer; on-line tokenization process for visual data.) 로 활용하면서 클래스토큰(class token; class embedding representing the whole image.) 에 대한 자기 증류(self-distillation; self distillation process on class token to acquire visual semantics.) 을 통해 시각적 의미론(visual semantics; visual meaning representation.) 을 획득하였습니다.온라인 형태소 분석기는 MIM 목적함수(objective function for MIM; objective function for Masked Image Modeling task.) 와 함께 공동 학습(joint learning; joint training process of tokenizer and main model.) 될 수 있으며 사전학습(pre-training stage of tokenizer; pre-training phase where the tokenizer is trained separately before being used in the main model training pipeline.) 단계가 필요한 복잡한 다단계 학습 파이프라인이 불필요해집니다.우리는 ImageNet-1K 데이터셋(ImageNet-1K dataset; a widely used benchmark dataset containing 1 million images across 1000 categories for image classification tasks.) 에서 82.3% 의 선형 탐색 정확도(linear probing accuracy; evaluation metric where a linear classifier is trained on top of frozen features extracted by the model to assess its representation quality without fine-tuning the entire model.), 그리고 87.8% 의 미세 조정 정확도(fine-tuning accuracy; evaluation metric where the entire model is fine-tuned on a specific task to assess its performance after pre-training on a large dataset like ImageNet-1K or similar benchmarks.) 를 달성함으로써 iBOT 의 우수성을 입증하였습니다.최신 이미지 분류(image classification results; state-of-the-art image classification performance metrics and outcomes reported in recent research papers and benchmarks.) 결과뿐만 아니라 지역적인 의미 패턴(local semantic patterns in images; local regions within images that exhibit meaningful and coherent visual structures or features important for understanding the overall content of the image such as objects or parts of objects.), 즉 이미지 내 특정 영역들의 유익한 시각 구조 또는 특징들이 등장한다는 점 또한 강조하였습니다.이는 모델들이 일반적인 부패(common corruptions such as noise addition or color jittering applied to images during testing to evaluate their robustness under different conditions.; common types of image distortions or degradations used to test the robustness of models under various conditions including noise addition and color jittering etc., which are applied during testing phases to evaluate how well models can maintain their performance when faced with these challenges.), 즉 노イ즈 추가(noise addition), 색상 변동(color jittering), 등의 다양한 조건 하에서 견고성을 평가하기 위해 적용되는 일반적인 이미지 왜곡이나 저하(common types of image distortions or degradations used to test the robustness of models under various conditions including noise addition and color jittering etc., which are applied during testing phases to evaluate how well models can maintain their performance when faced with these challenges.), 에 대한 강한 견고성(strong robustness against common corruptions such as noise addition or color jittering applied to images during testing to evaluate their robustness under different conditions.; ability to maintain high performance despite common types of image distortions or degradations applied during testing phases such as noise addition and color jittering etc., which are used to evaluate how well models can handle these challenges while maintaining their accuracy and reliability under varying conditions including different levels of corruption severity and types.), 즉 다양한 부패 정도와 유형에도 불구하고 고성능 유지 능력(strong ability to maintain high performance even when facing different levels of corruption severity and types.), 을 확보하는데 도움을 주며 밀집형 후속 작업(dense downstream tasks such as object detection or instance segmentation requiring pixel-level predictions.; tasks that require detailed predictions at a pixel level following pre-training stages including object detection and instance segmentation etc., which demand precise localization and identification of objects within an image at a granular level.), 즉 객체 검출(object detection), 인스턴스 분할(instance segmentation), 등의 픽셀 단위 예측(detailed predictions at a pixel level following pre-training stages including object detection and instance segmentation etc., which demand precise localization and identification of objects within an image at a granular level.), 을 요구하는 세밀한 후속작업들(dense downstream tasks requiring pixel-level predictions such as object detection or instance segmentation.; tasks that require detailed predictions at a pixel level following pre-training stages including object detection and instance segmentation etc., which demand precise localization and identification of objects within an image at a granular level.), 에서 선두 결과(state-of-the-art results on dense downstream tasks such as object detection or instance segmentation requiring pixel-level predictions.; leading-edge performance metrics achieved by models on dense downstream tasks like object detection or instance segmentation after being pre-trained using methods like iBOT that enhance their ability to make accurate pixel-level predictions even under challenging conditions involving common corruptions such as noise addition or color jittering etc., demonstrating superior robustness compared to other approaches while maintaining high precision in identifying objects within complex scenes at a fine-grained scale.), 즉 세밀한 장면 내 객체 식별(high precision in identifying objects within complex scenes at a fine-grained scale); fine-grained scale means detailed level down to individual pixels or small groups thereof., 능력을 유지하면서 다른 접근 방식보다 우월한 견고성을 보여주는 선두 성능(state-of-the-art results on dense downstream tasks such as object detection or instance segmentation requiring pixel-level predictions.; leading-edge performance metrics achieved by models on dense downstream tasks like object detection or instance segmentation after being pre-trained using methods like iBOT that enhance their ability to make accurate pixel-level predictions even under challenging conditions involving common corruptions such as noise addition or color jittering etc., demonstrating superior robustness compared to other approaches while maintaining high precision in identifying objects within complex scenes at a fine-grained scale.).为了简化并提高可读性，以下是更加简洁的版本：언어 Transformer의 성공은 주로 마스크 언어 모델링(MLM: Masked Language Modeling) 전제 작업 덕분입니다. 이 작업에서는 문장이 의미있는 부분들로 나뉩니다 (토큰화). 본 연구에서는 이러한 개념을 이미지 처리에도 적용하여 마스크 이미지 모델링(MIM: Masked Image Modeling)과 그 장점 및 도전 과제들을 살펴봅니다.우리는 iBOT(iBOT: Iterated Bootstrap with Online Tokenizer; 반복부트스트랩 및 온라인 형태소 분석기를 이용한 방법.)라는 새로운 자기 감독 프레임워크를 소개하는데, 이 프레임워크는 온라인 형태소 분석기를 사용하여 이미지를 부분별로 나누고 이를 통해 예측(predicting original content from masked inputs). 합니다.구체적으로:1. 마스크 패치 (patches): 이미지를 여러 개의 작은 영역으로 나눈 것.2. 자기 증류 (self-distillation): 대상모델(student model; 목표모델.)과 참조모델(teacher model; 교사모델.). 사이에서 정보 전달 및 개선 과정.3. 클래스토큰 (class token): 전체 이미지를 대표하는 특별한 임베딩(embedding).마스크된 패치토탈(masked patch tokens). 에 대해 자가증유(self-distillation). 를 수행하고 참조네트워크(teacher network). 를 온라인 형태소분석기(on-line tokenizer). 로 활용하면서 클래스토탈(class token). 에 대한 자가증유(self-distillation). 을 통해 시각적 의미론(visual semantics). 을 획득하였습니다.온라인 형태소분석기는 MIM 목적함수(objective function for MIM). 와 함께 공동학습(joint learning). 될 수 있으며 복잡한 다단계 학습 파이프라이닝(multi-stage training pipeline where tokenizer needs pre-training beforehand.). 가 필요하지 않습니다.우리는 ImageNet-1K 데이터셋(ImageNet-1K dataset). 에서 82.3% 의 선형탐색 정확도(linear probing accuracy). 와 87.8% 의 미세조정 정확도(fine-tuning accuracy). 를 달성하여 iBOT 의 우수성을 입증하였습니다.또한 지역적인 시각 패턴(local semantic patterns in images). 의 등장(emergence). 을 강조하였는데, 이는 모델들이 노イ즈 추가(noise addition), 색상 변동(color jittering), 등의 일반적인 부패(common corruptions). 에 대해 견고성이 뛰어남(strong robustness against common corruptions such as noise addition or color jittering.). 을 나타내며 픽셀 단위 예측(dense downstream tasks such as object detection or instance segmentation requiring pixel-level predictions.). 을 요구하는 세밀한 후속작업들(state-of-the-art results on dense downstream tasks such as object detection or instance segmentation requiring pixel-level predictions.). 에서 최상급 성능(state-of-the-art results on dense downstream tasks.). 을 발휘하도록 돕습니다.