NewsMar 31, 2026via GEN

Common Ancestry Limits Protein Sequence Exploration, Computational Study Shows

Full Text

Credit: Christoph Burgsted/Science Photo Library/ Getty Images For all the excitement surrounding AlphaFold and the surge of AI‑driven protein design tools, one fact often goes unexamined: nearly every model is trained on databases of known proteins. These databases feel vast, but compared to the astronomical number of sequences that could, in principle, form a functional protein, they represent only a sliver of what’s possible. That raises a fundamental question for the field: how representative is the protein universe we currently know? A new study published in PNAS and titled “ Descent from a common ancestor restricts exploration of protein sequence space ,” takes direct aim at that question, using large‑scale protein evolution modeling to probe the limits of protein diversification. The work, led by researchers at the Okinawa Institute of Science and Technology (OIST), the Institute of Science and Technology Austria (ISTA), the University of Vienna, and the Centro de Astrobiología (CAB), suggests that the boundaries of today’s protein diversity were set remarkably early in life’s history. “Modern AI methods are thought to be revolutionizing protein design,” said Fyodor Kondrashov, PhD, who heads OIST’s Evolutionary and Synthetic Biology Unit, in a press release. “Yet most of these AI design methods are typically trained on databases of known proteins. Without understanding how representative these known proteins are of sequence space, how confident can we be that such methods can generate truly diverse protein designs?” An abstract, not-to-scale visual representation of different protein sequence spaces for a particular biological function. A large box represents all possible amino acid sequences (approximately 20L number of combinations, where L is the chain length). A smaller blue patch represents the sequences that create functional proteins, and an even smaller green area shows the sequences that we have historically confirmed to exist. Gold-colored lines show protein evolution pathways, with red paths describing evolutionary lines describing extinct pathways (i.e., those that aren’t thought to be present in modern biodiversity). [Isakova et al.] To investigate that representativeness, the team began by mathematically describing the region of sequence space occupied by known proteins. They estimated the “dimensionality” of each protein family, a measure of how many independent directions evolution has actually sampled, using correlation‑based analyses of sequence variation. The researchers then simulated protein evolution to test how ancestry, selection, and epistasis shape the diversity of sequences that evolution can reach. The goal was to estimate how many functional sequences should exist for a given protein family and to compare that theoretical diversity with the diversity observed in nature. What emerged was a striking pattern: the strongest constraint on protein evolution is not selection or epistasis, but ancestry itself. Proteins tend to remain clustered near the sequences of their earliest ancestors, with limited divergence into the broader functional landscape. As the authors report, “For some gene families the effective topological dimension was on the order of one; in just a few families it was larger than 10.” “That [the] starting point is the main evolutionary limit is not necessarily surprising, but the scale of its importance is really quite remarkable,” said lead author Lada Isakova, a PhD student within the unit. “As an evolutionary biologist, I was intrigued to see how little selection and epistasis seemed to matter in our results.” The findings also feed into long‑standing debates about how the first proteins emerged. The team’s simulations suggest that early protein families could not have arisen simply by mutating a single first sequence. Instead, the data point toward DNA recombination “to create new DNA molecules which could encode very different proteins,” explained Isakova. For today’s protein engineers, the work underscores a practical limitation: AI models trained on existing proteins may struggle to extrapolate into unexplored regions of sequence space. “Most methods won’t be able to generalize well beyond the current known sequence space,” Isakova noted. “We can see there are huge swaths of sequence space left to be explored, but it’ll take new experimental data to enable expansion into these unknown realms.” News

View original source →