HyperAI

Learn About the PDB Protein Structure Dataset Behind AlphaFold 2 in One Article

4 years ago
Big factory news
Headlines
神经小兮
特色图像

DeepMind's latest generation algorithm, AlphaFold 2, has completely defeated other competitors in CASP, which is recently known as the "Protein Olympics", and has achieved amazing breakthroughs, shocking the entire scientific research community. After being overwhelmed by this scientific research achievement, let's take a look at the data set behind the algorithm.

In the past two days, we have been bombarded with news about DeepMind's new generation artificial intelligence algorithm AlphaFold 2, especially in the biological world, which can be said to have ushered in a milestone breakthrough.

According to DeepMind's official announcement, its deep learning algorithm AlphaFold 2 has successfully solved major problems in the biological field over the past 50 years.

The algorithm can accurately predict the 3D structure of proteins based on their amino acid sequences, with an accuracy comparable to 3D structures solved using experimental techniques such as cryo-electron microscopy (CryoEM), nuclear magnetic resonance or X-ray crystallography.

Examples of two target proteins in the CASP competition's free modeling category,Green is the experimental observation result, blue is the AlphaFold 2 prediction result

This milestone event excited biologists, but it also made many people in the industry tremble with fear, and they said they would switch careers to study deep learning.

However, while everyone is paying attention to this scientific research result, don’t forget the hero behind it—— PDB protein structure data set, a data set dedicated to collecting three-dimensional structure data of proteins and nucleic acids.

This data set is essential for a groundbreaking breakthrough

According to DeepMind, the team trained the system on public data.These data come from the protein structure dataset PDB and the large database UniProt containing protein sequences of unknown structures, which together include about 170,000 protein structures.

in,PDB is a data set dedicated to the three-dimensional structure of proteins and nucleic acids. It has a very long history, dating back to 1971.

Walter Hamilton of Brookhaven National Laboratory in the United States decided to establish this database. In October 1998, PDB was handed over to the Research Collaboratory for Structural Bioinformatics (RCSB), which was managed by Helen M. Berman of Rutgers University, who was also a member of RCSB.

Brookhaven National Laboratory was established in 1947,Its scientific discoveries have won the Nobel Prize five times.

In 2003,The PDB evolved into an international organization, wwPDB (Worldwide Protein Database), to oversee the PDB resources. Other members of wwPDB, including PDBe (Europe), RCSB (USA), and PDBj (Japan), also provide centers for data accumulation, processing, and publication for PDB.

The resources of wwPDB are open to the public free of charge worldwide.

It is worth mentioning that although the data in PDB are submitted by scientists from all over the world, each submitted data will be reviewed and annotated by wwPDB staff to check whether the data is reasonable. PDB and the software it provides are now open to the public free of charge.

More than 140,000 structures, what information is in the PDB?

Over the past few decades, the number of structures in the PDB has grown at a near exponential rate:

  • 100 in 1982;
  • 1,000 in 1993;
  • 10,000 in 1999;
  • 100,000 in 2014.

However, since 2007, the rate at which new protein structures are accumulated appears to have stabilized.

The number of new structures in the PDB dataset increases year by year, and the scale is huge

Structural biologists around the world use methods such as X-ray crystallography, NMR spectroscopy and cryo-electron microscopy to determine the position of each atom in a molecule relative to each other. They then submit this structural information, which is annotated and publicly released to the wwPDB.

You can search for structures of ribosomes, oncogenes, drug targets, and even entire viruses in the PDB dataset.However, the sheer number of structures archived in the PDB makes finding the information you need potentially a daunting task.

The information in the PDB dataset mainly includes:The source of protein/nucleic acid, the composition of protein/nucleic acid molecules, atomic coordinates, experimental methods used to determine the structure, as well as other data and information such as temperature factors and structure determiners.

A preview of the dataset, which contains information such as the protein/nucleic acid structure and the experimental methods used to determine the structure.

 How to download? 

The dataset is now available on the official website of Hyperneuron and openbayes.com. Visit:https://orion.hyper.ai/datasets/13906  Or click "Read original text" to get the data set with one click.

■ Details of PDB protein structure dataset

Release time:Collected since 1971

Publishing Agency:wwPDB

Quantity included:140,000+ protein/nucleic acid structures

Data format:csv file

Data size:27 MB (146 MB after decompression)

Download address:https://orion.hyper.ai/datasets/13906

The same dataset as DeepMind, you deserve it too~


 How to use? 

Our partner OpenBayes is a cloud service that provides cloud computing power for machine learning. They have a large-scale supercomputing cluster, and the GPU cluster architecture is designed specifically for matrix computing. It provides computing power containers for AI applications, and it is very easy to use and can be used out of the box.

Currently, OpenBayes' computing power container products already support TensorFlow, PyTorch, MXNet and other CPU and GPU environments, different versions and types of standard machine learning frameworks and various common dependencies.

Currently, the OpenBayes computing container supports standard librariesand provide CPU, NVIDIA T4, NVIDIA Tesla V100 and other computing resourcesWhether it is centralized training of massive data or low-power model resident operation, it can easily meet user needs.

From CPU to T4 to V100, a wide range of computing container configurations OpenBayes supportScript upload and JupyterLab editorOnline programming and then model training.

Clear and concise execution processFull tutorial: https://openbayes.com/docs/quickstart/
 Register as a new user to enjoy GPU computing power 

Visit openbayes.com, click on the official website to register immediately, and there will be free gifts every week during the internal test period, so you don’t have to compete with classmates and colleagues for computing power~

The dataset can be used/downloaded directly from public resources

 Event Description Visit openbayes.com Register as a new user with the invitation code [HyperAI]You can enjoy

Free CPU quota:300 minutes/week

Free vGPU quota:180 minutes/week

PDB full dataset acquisition:

https://www.rcsb.org/#Category-download

The files in the PDB dataset can be viewed directly with a text editor, but it is better to use a visualization tool. The officially recommended viewing program is Swiss PDB viewer:

https://spdbv.vital-it.ch/disclaim.html#

Other references:

http://www.wwpdb.org/

https://www.novopro.cn/articles/201912021193.html

-- over--