HyperAIHyperAI

Command Palette

Search for a command to run...

a year ago

State-aware protein-ligand complex prediction using AlphaFold3 with purified sequences

Enming Xing Junjie Zhang Shen Wang Xiaolin Cheng

One-click Deployment of AlphaFold3

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)
Go to Notebook

Abstract

Deep learning-based prediction of protein-ligand complexes has advanced significantly with the development of architectures such as AlphaFold3, Boltz-1, Chai-1, Protenix, and NeuralPlexer. Multiple sequence alignment (MSA) has been a key input, providing coevolutionary information critical for structural inference. However, recent benchmarks reveal a major limitation: these models often memorize ligand poses from training data and perform poorly on novel chemotypes or dynamic binding events involving substantial conformational changes in binding pockets. To overcome this, we introduced a state-aware protein-ligand prediction strategy leveraging purified sequence subsets generated by AF-ClaSeq—a method previously developed by our group. AF-ClaSeq isolates coevolutionary signals and selects sequences that preferentially encode distinct structural states as predicted by AlphaFold2. By applying MSA-derived conformational restraints, we observed significant improvements in predicting ligand poses. In cases where AlphaFold3 previously failed—producing incorrect ligand placements and associated protein conformations—we were able to correct the predictions by using sequence subsets corresponding to the relevant functional state, such as the inactive form of an enzyme bound to a negative allosteric modulator.

One-sentence Summary

This study introduces a state-aware protein-ligand complex prediction strategy that integrates AlphaFold3 with AF-ClaSeq purified sequence subsets to isolate coevolutionary signals for distinct structural states, applying MSA-derived conformational restraints to overcome ligand pose memorization and correct AlphaFold3 prediction failures in dynamic binding events.

Key Contributions

  • A state-aware protein-ligand prediction strategy integrates AF-ClaSeq-derived purified sequence subsets into deep learning structure prediction pipelines. This approach filters multiple sequence alignments to retain only sequences encoding distinct structural conformations, applying evolutionary restraints that guide folding algorithms toward ligand-compatible states.
  • Accurate conformational sampling depends on sequence purity rather than alignment depth, with state-specific evolutionary signals distributed across diverse phylogenetic clades. This finding circumvents the signal-averaging limitations of attention-based transformers when processing heterogeneous evolutionary data.
  • Incorporating purified sequence subsets into the AlphaFold3 framework produces more accurate ligand placements and binding pocket geometries than default predictions. The method corrects prior modeling failures on targets requiring substantial conformational rearrangements or novel chemotype binding.

Introduction

Accurate prediction of protein-ligand complexes is a cornerstone of computational drug discovery and molecular modeling, with recent deep learning architectures like AlphaFold3 achieving breakthrough accuracy by leveraging coevolutionary signals from multiple sequence alignments. Despite these advances, current models frequently struggle with novel chemotypes and fail to capture binding pocket plasticity, often defaulting to a single static conformation or relying on memorized structural patterns from training data. To overcome these barriers, the authors leverage a sequence purification framework called AF-ClaSeq to isolate evolutionary subsets that specifically encode distinct protein conformations. By integrating these state-specific sequence subsets into AlphaFold3, they introduce conformational restraints that guide the model toward functionally relevant structural states, significantly improving ligand pose accuracy and enabling reliable predictions for dynamic binding events.

Method

The authors leverage a sequence purification strategy to improve the prediction of protein-ligand complexes involving allosteric inhibitors, particularly for proteins like EGFR and IL-1β where the target conformation is distinct from the default state predicted by AlphaFold3 (AF3). The core of the method involves biasing the multiple sequence alignment (MSA) used by AF3 to favor a specific protein conformation, thereby guiding the model toward the desired structural state. The process begins with a deep MSA generated from a large pool of homologous sequences. For EGFR, the initial MSA of 49,743 sequences was filtered to 43,365 using a coverage threshold. To assess the conformational bias encoded within this MSA, the authors performed a series of M-fold sampling experiments, predicting structures from randomly shuffled groups of sequences. The initial analysis revealed a strong bias toward the active state, which was attributed to the model's tendency to predict well-represented conformations.

To correct this bias, an iterative enrichment procedure was applied. The method uses the root-mean-square deviation (RMSD) of key structural elements—the αC helix and activation loop (A-Loop)—relative to a reference structure of the desired inactive state (PDB 2GS7) as a metric. Sequences are grouped into sets of six, and after prediction, the sequences yielding structures with the lowest RMSD to the inactive state are selected for the next iteration. This process was repeated four times, progressively enriching the sequence pool for those encoding the inactive conformation. The enriched sequence set from the final iteration was then used for M-fold sampling, generating a large number of predictions to map the conformational landscape. The results, visualized in a scatter plot, show a distribution of predicted structures along a conformational axis, with the local pLDDT score indicating the confidence of the predicted secondary structures.

To identify the most strongly biased sequences, a voting mechanism was employed. The prediction results were binned based on their RMSD to the active and inactive states. A "normal voting" scheme selected sequences that were most frequent in a given bin. A more stringent "enforced voting" scheme required not only the highest frequency but also that the frequency exceeded a threshold of 0.15, isolating sequences with a very strong conformational preference. This process identified a subset of sequences, designated as "pure sequences," that were highly biased toward the inactive state. These purified sequences were used to generate a final MSA for AF3 predictions of the protein-ligand complex, with the ligand's SMILES provided as input.

The method was validated on the EGFR-4 allosteric inhibitor complexes. Default AF3 predictions were highly divergent and failed to reproduce the correct ligand pose. However, predictions using the purified inactive-state sequences showed dramatically improved accuracy and consistency. The ligand RMSD dropped to near or below 2.5 Å, and the ligand atomic pLDDT scores increased significantly, indicating a reliable prediction of the binding mode. A similar approach was applied to the IL-1β/ligand system, where the key conformational change involves the displacement of the β4-5 loop. A similar iterative enrichment based on RMSD to the displaced loop state was used to generate an enriched sequence set. Predictions using the top 20 most frequent sequences from the target conformational region achieved perfect alignment with the experimental structure, demonstrating the power of conformational restraints derived from the MSA.

Experiment

The study evaluated two allosteric inhibitor systems, EGFR mutants and IL-1β cryptic pocket antagonists, by comparing default AlphaFold3 predictions against those guided by iterative sequence purification based on generalized conformational references. While default models consistently failed to capture the necessary allosteric or cryptic binding conformations and tended to revert to memorized training patterns, the purification approach successfully biased multiple sequence alignments toward functionally relevant inactive or allosteric states. Consequently, the refined predictions closely matched experimental crystal structures for both protein backbones and ligand poses, demonstrating that conformational constraint-guided sequence enrichment provides a broadly applicable framework for accurately modeling novel allosteric drug-target interactions.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp