HyperAI

UniRef50 Protein Sequence Dataset

Date

5 days ago

Publish URL

www.uniprot.org

Categories

Download Help

The UniRef50 protein sequence dataset is from the UniProt knowledge base, and the related paper results are "AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model".

This dataset, derived from UniProtKB and filtered from UniParc sequences via iterative clustering (UniProtKB+UniParc → UniRef100 → UniRef90 → UniRef50), contains 41,546,293 training sequences and 82,929 validation sequences. This iterative process ensures high-quality, non-redundant, and diverse representation of UniRef50 sequences, providing extensive coverage of the protein sequence space for protein language models.