Date

a year ago

Publish URL

www.uniprot.org

Paper URL

Tags

The UniRef50 protein sequence dataset is from the UniProt knowledge base, and the related paper results are "AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model". This dataset, derived from UniProtKB and filtered from UniParc sequences via iterative clustering (UniProtKB+UniParc → UniRef100 → UniRef90 → UniRef50), contains 41,546,293 training sequences and 82,929 validation sequences. This iterative process ensures high-quality, non-redundant, and diverse representation of UniRef50 sequences, providing extensive coverage of the protein sequence space for protein language models.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.