ProtT3 Protein Text Question Answering Dataset
Date
Size
Publish URL
The ProtT3 dataset was jointly constructed by research teams from the National University of Singapore, the University of Science and Technology of China, and Hokkaido University in 2024.ProtT3: Protein-to-Text Generation for Text-based Protein Understanding", and has been selected for ACL 2024. This dataset is a pre-training dataset for the paper research.
The ProtT3 dataset consists of three datasets: Swiss-Prot, ProteinKG25 and PDB-QA.

As shown in the table above, Swiss-Prot is a protein sequence database with text annotations. The researchers processed the dataset and excluded the protein names from the text annotations to prevent information leakage. The generated text descriptions connect the annotations of protein function, location, and family.
ProteinKG25 is a knowledge graph derived from the GeneOntology database. The researchers first aggregated triplets of the same protein and then filled the protein information into a predefined text template to convert its triplets into free text.
PDB-QA is a protein single-round question-answering dataset derived from RCSB PDB2. It contains 30 question templates about protein structure, properties, and supplementary information. As shown in the table below, for fine-grained evaluation, researchers divided the questions into 4 categories based on the format of the answer (string or number) and the content focus (structure/property or supplementary information).
