SandboxAQ unveils an open dataset for drug discovery

SandboxAQ unveils an open dataset for drug discovery

Tech in Asia·2025-06-19 17:00

SandboxAQ has introduced SAIR (Structurally Augmented IC50 Repository), a dataset containing protein-ligand pairs with experimental potency data.

The company describes this dataset as the largest of its kind, aimed at helping researchers enhance AI models for drug discovery by improving predictions of protein-ligand binding affinities.

SAIR includes around 5.2 million synthetic 3D structures across more than 1 million protein-ligand systems.

It was developed using Nvidia’s DGX Cloud platform along with SandboxAQ’s AI Large Quantitative Model (LQM) capabilities.

The project reportedly achieved a twofold improvement in GPU utilization through collaboration with Nvidia.

The dataset integrates physics-based modeling and AI to boost reliability and speed in predicting molecular interactions.

.source-ref{font-size:0.85em;color:#666;display:block;margin-top:1em;}a.ask-tia-citation-link:hover{color:#11628d !important;background:#e9f6f5 !important;border-color:#11628d !important;text-decoration:none !important;}@media only screen and (min-width:768px){a.ask-tia-citation-link{font-size:11px !important;}}

🔗 Source: SandboxAQ

🧠 Food for thought

1️⃣ From hundreds to millions: the exponential evolution of binding data resources

The SAIR dataset represents a significant leap in the scale of protein-ligand binding data available to researchers, fundamentally changing what’s possible in computational drug discovery.

In 2006, BindingDB—then considered comprehensive—contained only about 20,000 experimentally determined binding affinities for approximately 11,000 small molecule ligands and 110 protein targets 1.

This new dataset of 5.2 million synthetic structures is roughly 260 times larger than what researchers had access to less than two decades ago, addressing what has long been a critical bottleneck in developing reliable AI models for drug discovery.

The transition from painstakingly collected experimental data to synthetically generated structures at massive scale parallels similar transformations in computer vision and natural language processing, where synthetic data augmentation enabled breakthrough performance.

The integration of binding data with structural repositories has been evolving since at least 2010, when RCSB PDB began incorporating binding constants like IC50 and Ki directly into their structure summaries, showing the field’s consistent movement toward more accessible and comprehensive datasets 2.

2️⃣ AI drug discovery moves from hype to validation with quantifiable benchmarks

The SAIR dataset provides a standardized benchmark that could accelerate progress across the entire AI drug discovery sector, which has been growing rapidly but often lacks comparable performance metrics.

Traditional drug development typically costs between $1.3-4 billion and takes 10-15 years to complete, creating enormous pressure to develop more efficient discovery methods 3.

SandboxAQ’s claim that models trained on SAIR can deliver predictions at least 1,000 times faster than traditional physics-based methods represents a concrete, measurable advancement in an industry often characterized by difficult-to-verify performance claims.

This acceleration aligns with broader industry trends, as companies like Insilico Medicine have advanced AI-discovered drug candidates into Phase 2 clinical trials for idiopathic pulmonary fibrosis, demonstrating real-world validation of AI approaches 4.

The collaboration with NVIDIA to achieve 2x improvement in GPU utilization highlights the critical role of specialized computing infrastructure in enabling these advances, showing that progress depends on both algorithmic and hardware innovations.

SandboxAQ’s strategic partnerships with research institutions like UCSF’s Institute of Neurodegenerative Diseases and organizations like the Michael J. Fox Foundation exemplify how AI drug discovery companies are increasingly validating their approaches through collaborations with established biomedical research entities 5.

3️⃣ Hybrid AI-physics approaches emerge as the winning paradigm

The SAIR dataset represents a distinctive approach that combines AI with physics-based modeling, addressing limitations of purely data-driven or purely physics-based methods.

While many AI drug discovery efforts focus exclusively on machine learning, SandboxAQ’s Large Quantitative Models (LQMs) integrate AI with physics-based methods, reflecting an emerging consensus that hybrid approaches deliver superior results 5.

This integration has already shown practical benefits, with SandboxAQ reporting a 30-fold increase in hit rates for candidate molecules compared to traditional methods, directly addressing the high failure rate that plagues conventional drug discovery 5.

By making SAIR publicly available, SandboxAQ is enabling broader scientific adoption of hybrid modeling approaches that leverage both the speed of AI and the mechanistic understanding provided by physics-based simulations.

The company’s approach with AQBioSim division focuses on accelerating small molecule and biologics development throughout the drug discovery lifecycle, addressing computational bottlenecks that have historically limited virtual screening and candidate optimization 5.

……

Read full article on Tech in Asia

Technology