NewsMar 30, 2026via GEN

Why Data Infrastructure Determines AI Success in Drug Discovery

Full Text

Credit: Nikada/Getty Images The drug discovery industry’s investment in and reliance on AI have reached unprecedented levels over the last five years. Amid this excitement, one fundamental question is often overlooked: what happens when sophisticated AI models encounter the messy reality of laboratory data? AI is becoming inseparable from modern drug discovery—with machine learning models applied to quantitative structure–activity relationship (QSAR) analysis, target identification, lead optimization, and even regulatory strategy. Despite impressive algorithmic advances, many AI efforts in life sciences underperform or fail outright. AI model sophistication is not to blame; rather data quality, structure, and governance are. Here we examine the infrastructure requirements for AI-ready drug discovery data, focusing on three interconnected challenges: capturing molecular complexity in machine-readable formats, implementing FAIR (Findable, Accessible, Interoperable, and Reusable) data principles at scale, and preparing datasets through collaboration-centric platforms. All provide the foundation that allows AI to function reliably in real-world R&D environments. The data quality imperative Fay Lin’s December 2025 GEN interview with Najat Khan, PhD, reminded us of Recursion’s next chapter, demonstrating that AI-driven drug discovery can consistently deliver clinical value. Clinical rigor relies on superior preclinical execution. AI-empowered drug discovery requires platforms that support the Design-Make-Test-Analyze (DMTA) cycle—the fundamental, iterative four-stage workflow in pharmaceutical R&D to progress therapeutic intellectual property into drugs. Computational drug discovery has long relied on the principle of chemical awareness: the understanding that a molecule’s specific chemical properties are the primary drivers of its biological function and mechanism of action. This focus established a standard where platforms were optimized for small molecules, representing their structures as discrete graphs of atoms and bonds to facilitate high-throughput property prediction and lead optimization. Machine learning models learn from data, and their effectiveness depends entirely on the quality of that input. Feed them clean, well-structured data, and they can identify subtle SAR patterns that complement human intuition. Feed them inconsistent, incomplete, ambiguous, or poorly formatted data, and they will learn the noise along with the signal, yielding confident but less reliable predictions. In drug discovery, these weaknesses are amplified by systemic issues, including: Heterogeneous assay formats and naming conventions Incomplete or ambiguous chemical representations Poor linkage between biological results and chemical context Loss of provenance during collaboration, outsourcing, staff turnover, reorganizations, or M&A The challenge multiplies for today’s biological therapeutics. Antibody-drug conjugates (ADCs), peptide therapeutics, and oligonucleotide drugs present additional representational complexity that traditional small molecule databases were not designed to handle. Atomic-level representation of biologics For decades, the pharma industry treated small molecules and biologics as separate domains requiring different informatics approaches. Small molecules were represented atom-by-atom in formats like SMILES or MOL files. Biologics—proteins, peptides, oligonucleotides—were typically reduced to sequence notation, i.e., strings of letters representing amino acids or nucleotides. This bifurcation creates problems as therapeutic modalities converge. ADCs combine a monoclonal antibody, typically with a cytotoxic small molecule payload, connected through a chemical linker. The linker’s attachment chemistry, the payload’s structure, and the drug-to-antibody ratio all influence pharmacological properties. Sequence notation alone cannot capture this information. A more comprehensive approach represents biologics at atomic resolution while maintaining the familiar sequence-level view that biologists prefer (e.g., hybrid schemas, HELM). “Chemically Aware Biologics” informatics means not just “lysine” but the complete atomic connectivity of each modified amino acid. When a cysteine residue forms a thioether bond with a linker-payload, that bond is captured explicitly rather than becoming lost to a footnote. The practical implementation requires databases that understand both chemical structure and biological sequence simultaneously—a biologic possessing “chemical awareness.” When a researcher queries “show me all ADCs with this payload attached via maleimide chemistry,” the system must parse both the small molecule structure and the conjugation site on the antibody. This atomic-level approach extends to other modalities. Synthetic peptides with non-natural amino acids, chemically modified oligonucleotides with phosphorothioate backbones, and cyclic peptides with non-standard connectivity all benefit from representations that capture complete molecular structure rather than approximating it through sequence annotation. Figure 1. Monomer SCSR—Observing monomer structures one at a time is a good solution for many use cases, but there are occasions where you just want to see the whole chemical structure. This is particularly relevant for oligomers. Figure 2. Antibody registration—Within the framework of this particular database management platform, the first step in the construction process of the antibody (a) is to pull out the hinge regions on the heavy chains and line these up, keeping the scale proportional. If there are any further domains indicated, segments are divided and annotated (b). Inter-chain linkages are drawn (c), as are any indicated disulfide bonds within individual chains (d). Finally, the identity for each amino acid is indicated using a color “bar code” style (e). Figure 3. Antibody (left) and ADC (right) renderings to depict chemically aware sequences. Implementing FAIR principles at industrial scale The FAIR data principles have become a rallying cry across scientific disciplines. In drug discovery, implementing these principles requires more than philosophical commitment; it calls for a specific technical infrastructure. Findability begins with better metadata annotation, which in turn requires tools that lower the barrier to capturing data provenance and context. Experiments are more valuable with standardized descriptors that specify what assay was run, what target was screened, and what conditions were used. Without consistent metadata, valuable datasets become less valuable—discoverable only by researchers who happen to know that they exist. Ontologies provide a formal, computer-readable vocabulary for annotation. An ontology defines standard terms and their relationships: “IC50” is a type of “dose-response parameter,” which is a “potency measurement.” When researchers use the same ontology to annotate their assays, datasets become searchable across projects and therapeutic modalities. Implementation is the challenge. Scientists focused on advancing their programs do not have time to spend on data annotation. Effective systems must make FAIR practices easier to adopt, suggesting appropriate ontology terms based on experimental context, flagging inconsistent annotations before they propagate, and providing immediate benefits (e.g., better searchability, easier report generation) that justify the modest extra effort. Interoperability requires common data formats and exchange standards. When a computational chemist wants to build a machine learning model using data from multiple internal projects, those datasets must be structurally compatible. Field names, units, and chemical representations should align without extensive manual harmonization. Annotation provides the opportunity to quantitatively define the similarity of related assays. This is particularly important when researchers are collaborating, such as during mergers and acquisitions. Organizations must integrate decades of research data from different informatics systems. Companies that maintain rigorous data standards can migrate and combine datasets relatively smoothly. They are often also able to take on new regulatory requirements, i.e., in vitro pharmacology (IVP) and new non-animal methods (NAM). Regulatory-ready data architecture Drug discovery data serves multiple disciplines. Its architecture must support scientific management decision-making, experimental SAR optimization, AI model training, and ultimately withstand regulatory scrutiny. Regulatory submissions increasingly emphasize data integrity and traceability. Auditors seek more than just the final results. They look for the complete chain of evidence: raw instrument data, processing steps, analytical decisions, and final interpretations. Systems designed only to train AI may lack the audit trails that regulators require. The solution is to build regulatory compliance into the data infrastructure from the very start. It means capturing provenance for every data point, maintaining version control for analytical methods, and implementing access controls that demonstrate data integrity. It’s no coincidence that regulatory requirements align with AI best practices, demanding consistent data formatting, clear documentation of experimental methods, and traceability of analytical decisions. Organizations that build for regulatory compliance will also find their data well-suited for machine learning, and vice versa. Bridging computational and experimental workflows The most sophisticated data infrastructure provides little value if it creates friction for bench scientists. Researchers will often seek workarounds to avoid systems that slow their work and resort to maintaining their own spreadsheets and local “databases” that fragment organizational knowledge. Effective implementation requires meeting scientists where they work. This translates to browser-based interfaces that require no software installation, integrated platforms that address scientific workflows across various functions and modalities, or APIs that connect with computational chemistry workflows. The goal is to make good data practices the path of least resistance. Perhaps the most important requirement is software that experimental scientists can use naturally within their existing workflows. Emerging AI capabilities offer new opportunities. Generative models can suggest bioisosteric replacements for problematic functional groups, potentially improving metabolic stability or reducing toxicity. Protein structure prediction tools like Alphafold2, ESMFold, Boltz-2, and their successors enable in silico assessments. Integrating these capabilities within the same environment that manages experimental data creates a seamless workflow from hypothesis generation through experimental validation. Figure 4. Bioisostere suggestion within the same environment for the entity registry connected to experimental data. Figure 5. Examples of Boltz-2 data seamlessly connected within the same environment where wet lab data is stored. Collaborative drug discovery Drug discovery increasingly operates through partnerships: academic collaborations, CRO relationships, consortium efforts, startups with big pharmaceutical companies, as well as precompetitive initiatives such as neglected tropical and rare disease collaborations. Effective collaboration requires granular access controls. Additional challenges require collaborative informatics for data partitioning, provenance, and read-only versus read-and-write user permissions. Who owns data generated in a collaboration? How long can each party retain it? Who can use it to train AI models? Informatics infrastructure must be flexible enough to implement whatever agreements emerge. This means role-based permissions, project-specific access rules, and audit trail capabilities that demonstrate compliance with data sharing agreements—granting partners visibility to relevant data while protecting proprietary information with clear data governance policies. As AI evolves, the demand for high-quality training data will only increase. Organizations that invest in data infrastructure during the current wave of AI enthusiasm will find themselves well-positioned for whatever computational advances emerge next. Those that focus solely on algorithms while neglecting to build the data foundation will be seen as lacking fundamental foresight and vision. The lesson is not that AI is overhyped. Realizing the potential requires equal attention to the less glamorous but more important role of data management. In drug discovery, as in architecture, foundations determine what structures can be built above them. Barry Bunin, PhD, is CEO ( [email protected] ) and director of the board; Alex Clark, PhD, is a research scientist; Peter Gedeck, PhD, is a c heminformatician and data scientist; and Janice Darlington is a scientist and mentor, all at Collaborative Drug Discovery (CDD). Tutorials

View original source →