Command Palette
Search for a command to run...
FedCV: 多様なコンピュータビジョンタスクのための連合学習フレームワーク
FedCV: 多様なコンピュータビジョンタスクのための連合学習フレームワーク
エッジ検出 — 一般的なコンピュータビジョンおよび深層学習のアプローチ
概要
連合学習(FL)は、エッジデバイス上の分散データセットからグローバルモデルまたはパーソナライズドモデルを学習できる分散学習のパラダイムである。しかし、コンピュータビジョン分野では、統一されたFLフレームワークにおいて多様なタスクへの探求が不足しているため、FLにおけるモデルの性能は集中型トレーニングに大きく遅れをとっている。FLは、物体検出や画像セグメンテーションといった高度なコンピュータビジョンタスクにおいて、効果的に実証された例はほとんどない。このギャップを埋め、コンピュータビジョンタスクにおけるFLの開発を促進するため、本稿では、画像分類、画像セグメンテーション、物体検出という3つの代表的なコンピュータビジョンタスクにおいてFLを評価するための連合学習ライブラリおよびベンチマークフレームワーク「FedCV」を提案する。我々は、非独立同分布(non-I.I.D.)のベンチマークデータセット、モデル、および各種の参照用FLアルゴリズムを提供する。我々のベンチマーク研究は、将来の探求に値する複数の課題が存在することを示唆している。すなわち、集中型トレーニングのトリックはFLに直接適用できないこと、非I.I.D.データセットは実際には異なるタスクにおいてモデルの精度をある程度低下させること、そして膨大なパラメータ数と各クライアントのメモリコストを考慮すると、連合トレーニングのシステム効率を向上させることは困難であることである。
One-sentence Summary
The authors propose FedCV, a unified federated learning framework and benchmarking library that provides non-I.I.D. datasets, models, and reference algorithms to evaluate image classification, segmentation, and object detection, demonstrating that centralized training techniques do not directly apply to federated learning, that non-I.I.D. data reduces accuracy across tasks, and that large parameter counts and per-client memory limits hinder system efficiency.
Key Contributions
- This work introduces FedCV, a federated learning library and benchmarking framework that evaluates distributed training across image classification, image segmentation, and object detection. The system provides a standardized environment for testing federated algorithms on representative computer vision tasks.
- The framework provides non-I.I.D. benchmarking datasets, standardized model architectures, and multiple reference federated optimization algorithms to support systematic evaluation.
- A comprehensive benchmark study identifies key challenges in applying federated learning to computer vision. The analysis demonstrates that centralized training techniques often fail in federated settings, non-I.I.D. data distributions reduce model accuracy, and large parameter counts create significant per-client memory bottlenecks.
Introduction
Federated learning enables privacy-preserving model training across decentralized edge devices, offering a practical solution for computer vision applications constrained by strict data privacy regulations, high transfer costs, or proprietary data requirements. Despite its potential, prior federated learning research has largely overlooked advanced vision tasks, focusing instead on small-scale image classification with centralized optimization techniques that do not translate well to distributed edge environments. This narrow focus leaves critical challenges like non-I.I.D. data distribution, hardware heterogeneity, and the scalability of large vision models largely unaddressed. To bridge this gap, the authors introduce FedCV, a unified federated learning library and benchmarking framework tailored specifically for computer vision. The framework supports image classification, segmentation, and object detection within a single distributed training architecture, providing standardized non-I.I.D. datasets, modular federated optimization algorithms, and scalable APIs to accelerate fair evaluation and real-world deployment.
Dataset
- Dataset Composition & Sources: The authors curate a federated computer vision benchmark suite comprising four primary datasets: Google Landmarks Dataset 23k (GLD-23K) sourced from smartphone users, the standard CIFAR-100 dataset, an augmented PASCAL VOC 2011 subset, and the COCO dataset.
- Subset Details: GLD-23K and CIFAR-100 are designated for image classification. The PASCAL VOC subset contains 11,355 images annotated across 20 foreground object classes and one background class. COCO provides realistic everyday scenes with detection and segmentation labels generated via Amazon Mechanical Turk.
- Data Usage & Partitioning: The team applies Latent Dirichlet Allocation (LDA) to partition all datasets into non-I.I.D. client distributions, simulating realistic federated learning environments. They use the data to train and evaluate federated optimization algorithms like FedAvg and FedOpt, while also running centralized fine-tuning baselines for comparison.
- Processing & Pipeline: Rather than implementing custom cropping or metadata extraction, the authors standardize preprocessing by providing automated download scripts and integrated data loaders. These pipelines handle annotation parsing and client-side distribution, ensuring consistent data formatting and streamlining setup across all federated clients.
Method
The authors leverage the FedML.ai framework as a foundation to design FedCV, an open-source federated learning system tailored for diverse computer vision tasks. The system architecture, illustrated in the framework diagram, integrates a layered structure to support both high-level research and low-level distributed communication. At the top, FedML-API provides high-level abstractions for users to define computer vision applications, including image classification, segmentation, and object detection, alongside non-IID datasets and data loaders. These applications are supported by a suite of federated learning algorithms, such as FedAvg, FedOpt, FedNova, FedProx, and others, which are implemented using distributed or standalone configurations. The core of the system is built upon FedML-core, which handles low-level functionalities like communication, model training, and security primitives. This layer includes components such as the ComManager for managing communication threads, and the Abstract Communication Layer that supports various protocols including MPI, MQTT, and tensor-aware RPC, enabling efficient inter-node communication across different data centers. The framework also incorporates security and privacy primitives, such as secure aggregation, to enhance the robustness of federated learning processes. 
The design of FedCV emphasizes flexibility and ease of use through modular components and extensible APIs. The system supports multiple computer vision applications, including healthcare, smart home surveillance, and autonomous driving, which are integrated into the framework through dedicated models and data handling mechanisms. The training engine within FedCV leverages PyTorch and supports on-client learning frameworks, allowing for customizable training procedures and model architectures. The communication layer is optimized for distributed computing, with components like the Send Thread and Receive Thread enabling efficient data exchange between workers and the coordinator. This modular design facilitates the implementation of diverse federated learning algorithms, including decentralized, vertical, and split learning paradigms, by abstracting the underlying communication and optimization processes. 
In the context of computer vision tasks, the authors introduce a specific model architecture for image segmentation, which is detailed in the figure below. This architecture employs a deep convolutional neural network (DCNN) as an encoder, processing input images through a series of convolutional layers with varying rates, such as 6, 12, and 18, to capture multi-scale features. The encoder output is then passed to a decoder, which reconstructs the segmentation map through upsampling and concatenation operations. The decoder uses 1x1 convolutions and 3x3 convolutions to refine the features, ultimately producing the final prediction. This design enables efficient feature extraction and reconstruction, suitable for various segmentation tasks. 
The overall framework is designed to address the research gap in federated learning for computer vision by providing benchmarks, reference implementations, and flexible APIs. This enables researchers to explore various training paradigms and network typologies efficiently. The system supports multi-GPU distributed training, which reduces computational time and makes large-scale experiments feasible. By integrating these components, FedCV offers a comprehensive platform for developing and evaluating federated learning algorithms in diverse computer vision applications. 
Experiment
Experiments across image classification, segmentation, and object detection tasks on various deep learning models and federated datasets evaluate the impact of data heterogeneity, training strategies, and system constraints. Classification results validate that fine-tuning significantly accelerates convergence and improves accuracy, while segmentation experiments demonstrate that training from scratch often matches fine-tuned performance and better suits memory-constrained edge devices. Object detection evaluations reveal that federated frameworks struggle to match centralized results on non-IID data due to optimizer incompatibilities and communication bottlenecks. Collectively, these findings underscore the critical trade-offs between model complexity, data distribution, and system efficiency in practical federated learning deployments.
The authors present experimental results on image classification tasks using different models and datasets, comparing centralized training with federated learning approaches. The results show that centralized training generally outperforms federated learning, with performance varying based on model type, data partitioning, and training strategies. The experiments highlight the impact of non-IID data distributions and the effectiveness of fine-tuning versus training from scratch. Centralized training achieves higher accuracy than federated learning for both EfficientNet and MobileNet V3 on CIFAR-100 and GLD-23k datasets. Fine-tuning provides better performance compared to training from scratch, with the gap being more pronounced under non-IID data conditions. The choice of learning rate and data partitioning significantly influences model accuracy in federated learning settings.
The authors present experimental results on object detection tasks using YOLOv5 models in a federated learning setting. The results show that model performance varies with structural complexity, where deeper and wider models achieve higher accuracy but require more training time. The performance gap between centralized and federated training is significant, indicating challenges in transferring centralized training techniques to federated environments. Deeper and wider YOLOv5 models achieve higher accuracy but require longer training times. Federated training performance is notably lower than centralized training for object detection tasks. The choice of model structure significantly impacts both accuracy and training efficiency in federated settings.
The authors present experimental results on image classification, image segmentation, and object detection tasks in federated learning settings. The results show that centralized training generally outperforms federated training, particularly for non-IID data distributions, and that fine-tuning can significantly improve performance compared to training from scratch. The choice of optimizer, learning rate, and model architecture also affects performance, with momentum SGD and learning rate schedulers showing mixed effects depending on the context. Centralized training achieves higher accuracy than federated training, especially under non-IID data conditions. Fine-tuning leads to faster convergence and better final accuracy compared to training from scratch. The choice of learning rate and optimizer impacts model performance, with momentum SGD and learning rate schedulers showing varying effectiveness across different settings.
The authors present experimental results on image classification tasks, comparing centralized training with federated learning methods across different models and non-IID data settings. The results show that centralized training generally outperforms federated learning, with performance differences varying based on the degree of data heterogeneity and the use of fine-tuning. The impact of optimization techniques such as momentum SGD and learning rate schedulers is also analyzed, revealing that their effectiveness depends on the data distribution and model architecture. Centralized training achieves higher accuracy than federated learning, with the gap being more pronounced under higher data heterogeneity. Fine-tuning improves performance compared to training from scratch, especially in scenarios with less data heterogeneity. The effectiveness of momentum SGD and learning rate schedulers varies depending on the data distribution and model type.
The authors present experimental results on image classification, segmentation, and detection tasks, focusing on the efficiency of training different deep learning models in a federated setting. The the the table summarizes system performance metrics, showing that larger models like ViT require significantly more communication rounds and time compared to smaller models, indicating higher communication overhead in federated learning. Larger models such as ViT require substantially more communication rounds and total training time compared to smaller models like MobileNet-V3 and EfficientNet. The communication cost is significantly higher for larger models, with ViT having a much greater communication cost than the other models. The total training time increases with model size, with ViT taking considerably longer to train than MobileNet-V3 and EfficientNet.
The experiments evaluate centralized versus federated learning across image classification, segmentation, and object detection tasks using diverse model architectures and datasets under non-IID data distributions. Results consistently demonstrate that centralized training achieves superior accuracy, with the performance gap widening as data heterogeneity increases. Fine-tuning proves more effective than training from scratch, particularly in federated environments, while larger model architectures incur substantially higher communication overhead and training latency. Ultimately, the findings highlight that federated learning performance is highly sensitive to data distribution, model complexity, and optimization strategies, underscoring the inherent challenges of replicating centralized training efficiency in distributed settings.