HyperAIHyperAI

Command Palette

Search for a command to run...

OpenTME: TCGA由来のAI駆動H&E腫瘍微小環境プロファイルのオープンデータセット

概要

腫瘍微小環境(TME)は、がんの進行、治療反応性、および患者の予後において中心的な役割を果たしているが、日常的に用いられるヘマトキシリン・エオシン(H&E)染色の病理組織像から大規模で一貫性のある定量的なTMEの表現型解析を行う試みは依然として限られている。本研究では、The Cancer Genome Atlas(TCGA)に収録された5種類のがん(膀胱癌、乳癌、大腸癌、肝癌、肺癌)を対象とした3,634枚のH&E染色ウエンスライド画像から導出された、事前計算済みTMEプロファイルからなるオープンアクセスデータベース「OpenTME」を公開する。OpenTMEの出力は、Atlas病態基礎モデルファミリーに基づいて構築されたAI駆動アプリケーション「Atlas H&E-TME」を用いて生成された。同アプリケーションは、組織の品質管理、組織セグメンテーション、細胞検出・分類、および空間的近傍分析を実行し、スライド単位で細胞レベルの解像度により4,500件以上の定量的測定の取得を実現している。OpenTMEは、Hugging Face上で非営利の学術研究目的で利用可能である。今後、OpenTMEは継続的に拡張されていく予定であり、バイオマーカーの発見、空間生物学研究、ならびにTME解析のための計算手法の開発における有用なリソースとなることを期待している。

One-sentence Summary

To address the scarcity of large-scale, consistent, and quantitative tumor microenvironment characterization from routine H&E-stained histopathology, the paper introduce OpenTME, an open-access dataset of pre-computed TME profiles from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung) from The Cancer Genome Atlas (TCGA), generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models that performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution, and available on Hugging Face for non-commercial academic research to support biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

Key Contributions

  • OpenTME is an open-access dataset of pre-computed tumor microenvironment profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA), providing over 4,500 quantitative readouts per slide at cell-level resolution.
  • The profiles were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models that performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis.
  • The dataset is available on Hugging Face for non-commercial academic research and is anticipated to serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

Introduction

The tumor microenvironment (TME) captured in routine H&E-stained histology slides holds critical prognostic and predictive information, yet systematic extraction of TME features from these images has been hindered by manual annotation bottlenecks and limited reproducibility. Although The Cancer Genome Atlas (TCGA) offers a wealth of digitized H&E slides, publicly available, AI-derived TME profiles at scale have been missing, forcing researchers to rerun costly inference pipelines or rely on small-scale annotations. The authors address this gap with OpenTME, an open dataset of AI-powered TME profiles computed from TCGA H&E whole-slide images, giving the community ready-to-use, standardized features that accelerate computational pathology research and enable robust benchmarking.

Dataset

The authors introduce OpenTME, a ready-to-use dataset of quantitative tumor microenvironment profiles derived from routine H&E-stained whole-slide images. It compiles pre-computed outputs from the AI-powered Atlas H&E-TME pipeline, which applies tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis to each slide.

  • Data source: Diagnostic formalin-fixed paraffin-embedded (FFPE) slides from The Cancer Genome Atlas (TCGA), accessed via the NCI Genomic Data Commons.
  • Cancer types and projects: Five indications across eight TCGA projects: bladder, breast, colorectal, liver, and lung cancer.
  • Final dataset size: 3,634 slides, after excluding 52 slides from an initial set of 3,686. Exclusions: 49 due to missing resolution metadata or file corruption, 2 non‑H&E stains (1 IHC, 1 Masson’s trichrome), and 1 slide that failed quality control because of fully out‑of‑focus tissue.
  • What each slide includes:
    • Over 4,500 quantitative readouts in CSV format, grouped into slide‑level tables per cancer type. The features span:
      • Tissue QC metrics (area and relative coverage per QC region)
      • Tissue segmentation metrics (area, count, roundness, eccentricity, etc., for seven tissue types)
      • Cell metrics (count, percentage, density, nuclear morphology per nine cell types, both slide‑level and stratified by tissue compartment)
      • Neighborhood metrics (spatial co‑occurrence statistics, ratios, densities within 20 µm and 40 µm radii)
    • Thumbnail images with overlay visualizations of tissue QC, tissue segmentation, and cell classification predictions.
  • How the dataset is used: The paper provides OpenTME as a resource for downstream research without requiring users to run AI inference. TME Studio, a collection of interactive marimo notebooks, accompanies the dataset with tutorials, immune infiltrate classification examples, Kaplan–Meier survival analysis, and visualizations. The authors intend it for biomarker discovery, spatial biology studies, and development of new computational pathology methods. No training split or mixture ratios apply, as the dataset contains aggregated slide‑level features, not raw images.
  • Additional processing notes: All features are generated by the Atlas H&E-TME application, which runs tissue quality control, segmentation into seven tissue types, cell detection and classification into nine cell types, and spatial neighborhood analysis. No patch‑level cropping is used; features are aggregated at the slide level. Researchers who need spatially resolved outputs (cell coordinates, polygon geometries) can apply through the Atlas H&E-TME Research Access Program.
  • Access and restrictions: The dataset is available on Hugging Face under a gated access model for non‑commercial academic research. Training models to replicate Atlas H&E-TME capabilities is prohibited, and users must comply with TCGA data use policies.

Method

The authors leverage a multi-stagecomputational pipeline within the Atlas H&E-TME application to process Whole Slide Images (WSIs) and extract detailed tumor microenvironment features. This pipeline consists of three sequential deep learning models: Tissue Quality Control, Tissue Segmentation, and Cell Classification.

As shown in the figure below:

The process begins with the Tissue QC model, which evaluates the input WSI to identify valid tissue regions, filter out artifacts, and exclude out-of-focus areas or markers. This step ensures that downstream analyses are performed only on high-quality tissue data. Following the quality control step, the Tissue Segmentation model partitions the valid tissue into distinct histological compartments. This model classifies regions into categories such as carcinoma, stroma, blood, epithelial tissue, and necrosis.

Finally, the Cell Classification model operates on the segmented tissue to identify and categorize individual cells within the tumor microenvironment. This model distinguishes between various cell types, including carcinoma cells, endothelial cells, epithelial cells, fibroblasts, granulocytes, lymphocytes, macrophages, and plasma cells. By chaining these three models, the application generates comprehensive slide-level tissue and cell readouts, as well as neighborhood readouts, enabling a granular analysis of the tissue architecture.

Experiment

The evaluation setup involved validating the Atlas H&E-TME application—a four-stage AI pipeline for single-cell tumor microenvironment profiling in H&E slides—against annotations from board-certified pathologists using diverse multi-source datasets and scanner types across five cancer indications. The validation confirmed that the system reliably performs tissue quality control, segments seven tissue classes, detects and classifies cells into nine types, and derives spatial readouts such as cell densities and neighborhood co-occurrence statistics. The overall findings demonstrate robust coverage of at least 90% of invasive morphological subtypes per supported cancer, underscoring the model's generalizability and practical utility for comprehensive spatial profiling in clinical research.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています