HyperAI

DocBank Text Dataset

Date

3 years ago

Size

48.1 GB

Organization

Beijing University of Aeronautics and Astronautics

Publish URL

github.com

Categories

特色图像

DocBank is a text dataset. The dataset contains 500,000 document pages with fine-grained, term-level annotations for document layout analysis. The dataset is constructed in a simple and effective way with weak supervision from \LaTeX{} documents available on arXiv.com.

DocBank.torrent
Seeding 1Downloading 2Completed 299Total Downloads 613
  • DocBank/
    • README.md
      967 字节
    • README.txt
      1.89 KB
      • data/
        • DocBank_500K_ori_img.zip.001
          5 GB
        • DocBank_500K_ori_img.zip.002
          10 GB
        • DocBank_500K_ori_img.zip.003
          15 GB
        • DocBank_500K_ori_img.zip.004
          20 GB
        • DocBank_500K_ori_img.zip.005
          25 GB
        • DocBank_500K_ori_img.zip.006
          30 GB
        • DocBank_500K_ori_img.zip.007
          35 GB
        • DocBank_500K_ori_img.zip.008
          40 GB
        • DocBank_500K_ori_img.zip.009
          45 GB
        • DocBank_500K_ori_img.zip.010
          47.41 GB
        • DocBank_500K_txt.zip
          47.9 GB
        • MSCOCO_Format_Annotation.zip
          48.1 GB