HyperAIHyperAI

ProtT3 Protein Text Question Answering Dataset

Date

a year ago

Size

1.4 GB

Organization

Hokkaido University
National University of Singapore
University of Science and Technology of China

Publish URL

github.com

The ProtT3 dataset was jointly constructed by research teams from the National University of Singapore, the University of Science and Technology of China, and Hokkaido University in 2024.ProtT3: Protein-to-Text Generation for Text-based Protein Understanding", and has been selected for ACL 2024. This dataset is a pre-training dataset for the paper research.

The ProtT3 dataset consists of three datasets: Swiss-Prot, ProteinKG25 and PDB-QA.

Statistics of the protein text dataset

As shown in the table above, Swiss-Prot is a protein sequence database with text annotations. The researchers processed the dataset and excluded the protein names from the text annotations to prevent information leakage. The generated text descriptions connect the annotations of protein function, location, and family.

ProteinKG25 is a knowledge graph derived from the GeneOntology database. The researchers first aggregated triplets of the same protein and then filled the protein information into a predefined text template to convert its triplets into free text.

PDB-QA is a protein single-round question-answering dataset derived from RCSB PDB2. It contains 30 question templates about protein structure, properties, and supplementary information. As shown in the table below, for fine-grained evaluation, researchers divided the questions into 4 categories based on the format of the answer (string or number) and the content focus (structure/property or supplementary information).

QA sample pairs in the PDB-QA dataset
ProtT3.torrent
Seeding 1Downloading 0Completed 154Total Downloads 214
  • ProtT3/
    • README.md
      2.13 KB
    • README.txt
      4.26 KB
      • data/
        • osfstorage-archive.zip
          1.4 GB