How Words are Like Genetic Parts

August 13, 2025 · James McInerney · 18 min read

The paradox of selection without phenotypic determinism

The modern synthesis established beyond doubt that while random genetic drift is a very important part of evolutionary history, particularly in small populations, overall natural selection is the primary force shaping genomic content across most of life. We see this in the extraordinary conservation of core metabolic genes across billions of years of evolution, in the rapid fixation of beneficial mutations in experimental populations, and in the clear signatures of selection across every genome we sequence. This dominance of selection over drift might seem to imply that genes must have fixed, deterministic phenotypic effects — after all, how could selection consistently favour certain genetic variants if their effects were not reliable and predictable?

Yet this leap from “selected” to “phenotypically deterministic” represents a profound logical fallacy that has constrained biological understanding for decades. Selection can powerfully shape genomic content without creating deterministic phenotypic outcomes because what selection evaluates is not fixed effects but the appropriateness of molecular functions within specific genomic and environmental contexts. If we take an analogy from behavioural evolution, we can see that natural selection has clearly shaped the neural mechanisms underlying fear responses in prey animals, but this doesn’t mean every rabbit freezes at every shadow. Instead, selection has tuned the deployment of fear responses given contextual cues — movement patterns, sounds, odours, time of day. The underlying neural mechanisms are conserved and reliable, yet their phenotypic manifestation is fundamentally context dependent.

The same principle applies at the molecular level. Genes typically encode stable molecular functions. These functions are carried out by proteins with specific catalytic activities, binding specificities, and structural properties. Nonetheless, the phenotypic effects of these functions depend entirely on when, where, and how much they are expressed, and how they integrate into existing cellular networks. A gene encoding glucokinase will reliably produce a protein that catalyzes glucose phosphorylation regardless of genomic context. However, whether this enzymatic activity enhances survival, has no effect, or even proves detrimental depends on the metabolic context, regulatory environment, and ecological niche of the organism.

The language-learning paradigm

The field of linguistics underwent a conceptual revolution that illuminates our path forward in genomics, but perhaps not in the way typically described. Traditional approaches to understanding language did indeed focus on fixed meanings and rule sets, but the more profound insight from modern linguistics concerns how meaning is actually acquired and deployed. Children do not learn language by memorising dictionary definitions. Instead, they infer word meanings from repeated exposure to contextual usage. A child learns what “hot” means not from being given a definition (“having a high temperature”) but from observing its use: “the stove is hot”, “hot soup”, “it’s hot outside”, paired with experiences of temperature, parental warnings, and observed reactions. This learning process resembles natural selection far more than rational design. Children don’t consciously formulate hypotheses about word meanings and test them systematically. Instead, they absorb patterns of usage, gradually refining their understanding through countless contextual exposures. The “meaning” of a word emerges from this experiential process, not from explicit instruction. Similarly, evolution doesn’t evaluate genes based on predetermined functional definitions. Genes are incorporated into genomes through processes like horizontal gene transfer (HGT), mutation, and recombination, and their fit is evaluated through the filter of natural selection acting on phenotypic outcomes in specific contexts.

The analogy above reveals an important distinction often missed in genomic thinking. While some words like “zebra” have relatively stable referents across contexts, their deployment and significance still depend heavily on context. “Zebra” in a wildlife documentary means something different from “zebra” in urban planning (zebra crossing), though both derive from the same core reference to the striped animal. Similarly, while a gene may encode a protein with stable molecular properties, its significance to the organism varies dramatically across cellular states, developmental stages, and environmental conditions. The molecular function provides the stable “referent”, but the phenotypic meaning emerges from contextual deployment. This perspective has implications for how we think about both evolutionary and synthetic biology.

Beyond fixed gene-phenotype mappings

Most importantly, the success of modern language models demonstrates that complex, appropriate responses can emerge from learned probability distributions over contextual relationships without requiring explicit storage of meanings or rules. Large language models don’t contain lookup tables of word definitions or grammatical templates, and they do not store sentences, books or libraries. Instead, they learn statistical patterns of word co-occurrence and the positions of these words across large amounts of word sequences. This enables them to generate contextually appropriate text by sampling from learned probability distributions. The model doesn’t “know” what words mean in any explicit sense, and it does not try to learn definitions from a dictionary. Instead the model has learned to predict likely continuations given contextual patterns.

This computational insight provides a powerful framework for understanding genomes. Traditional genetics has treated genomes as if they were instruction manuals containing explicit gene-phenotype mappings — gene A produces protein B which performs function C, leading to phenotype D. But genomes, like language models, don’t encode specific outcomes; they encode the molecular components and regulatory relationships that, given developmental and environmental context, generate probability distributions over possible phenotypes. A genome doesn’t contain a predetermined organism any more than a language model contains predetermined sentences. Instead, the genome contains the genes and regulatory elements that, through their expression and interaction patterns, generate probability distributions over possible cellular states and phenotypic outcomes.

This reconceptualisation immediately explains numerous puzzles in modern genomics that remain paradoxical under deterministic gene-phenotype models. Why do knockout screens give different results in different cell lines? Because the phenotypic effect of losing a molecular function depends on the cellular context — the presence of compensatory pathways, the metabolic demands of the cell type, the regulatory state. Why do synthetic genetic circuits that work perfectly in one organism fail when transferred to another? Because the same molecular components encounter different regulatory environments, metabolic states, and competing cellular processes. Why can we still not reliably predict phenotype from genotype despite having millions of sequenced genomes? Because phenotypes emerge from the contextual deployment of molecular functions, not from the functions themselves, though obviously there is the additional layer of genetic mutation that changes the “words” over time.

The essentiality paradox: same function, different phenotypic consequences

Perhaps nowhere is the context-dependent nature of gene effects more starkly illustrated than in studies of gene essentiality. The conventional view holds that some genes are “essential” — required for life — while others are “dispensable”. This binary classification underlies much of modern genetics, from the design of knockout screens to the identification of drug targets. Yet systematic studies comparing gene essentiality across different organisms reveal that essentiality is not an intrinsic property of molecular functions but an emergent property of how those functions integrate into specific genomic and cellular contexts. The fact that there are only a few dozen gene families that are universally found in all cellular life (Ciccarelli et al.) is testimony to this very plastic idea of essentiality.

Recent large-scale studies demonstrate that genes showing lethal knockout phenotypes in one genomic background can be completely dispensable in closely related organisms. Crucially, this context-dependency extends to genes whose molecular functions remain identical across species. Cell division genes like ftsL and divIC that are dispensable in Streptomyces coelicolor are essential for viability in bacteria like Bacillus subtilis. Genes encoding components of DNA Polymerase III show varying degrees of essentiality across bacterial genomes, not because their molecular functions change, but because of differences in redundancy, backup systems, and metabolic organisation.

DNA glycosylase enzymes provide compelling evidence for the distinction between molecular function and phenotypic effect. These enzymes maintain consistent molecular functions — recognising and removing specific types of damaged DNA bases through base-flipping mechanisms — across diverse organisms. However, their phenotypic essentiality varies dramatically depending on genomic context. Mouse knockout studies reveal that “the phenotype of DNA glycosylase disruptions in mice is usually rather moderate”, with most single glycosylase knockouts being “viable and fertile with only moderately increased mutation frequencies and no overt early disease phenotype”. This mild phenotype “has been attributed to the existence of back-up activities and/or alternative pathways for the removal of oxidised DNA bases”, including overlapping substrate specificities among different glycosylases and compensation by alternative repair mechanisms like transcription-coupled repair. The only exception is thymine DNA glycosylase (TDG), which “was recently reported to be essential for embryonic development in mouse”, likely due to its unique role in epigenetic regulation rather than just DNA repair. Thus, while the molecular function of base excision remains constant, the phenotypic consequences of losing these functions depend entirely on the genomic context, including the presence of backup systems, alternative pathways, and competing cellular processes.

This context-dependency of phenotypic effects has profound implications for practical applications. In antibiotic development, we target genes encoding essential molecular functions, assuming their inhibition will reliably kill pathogens. Yet the context-dependency of essentiality means that the same molecular target may be essential in some strains or conditions but not others, contributing to the evolution of resistance through compensatory mutations that restore function through alternative pathways.

The regulatory context: same function, variable deployment

The distinction between molecular function and phenotypic effect becomes clearest when examining gene regulation. A gene encoding a perfectly functional enzyme may have dramatically different phenotypic effects depending on its regulatory context — when it’s expressed, where it’s expressed, and how much is produced. The molecular function remains constant, but its deployment varies, leading to context-dependent phenotypic outcomes.

Alternative promoters provide a clear example of this principle. The same coding sequence can be driven by different regulatory elements in different organisms or cell types, leading to distinct expression patterns and phenotypic effects. A metabolic enzyme that’s constitutively expressed at high levels might be essential for growth, while the same enzyme under stress-responsive regulation might be dispensable under normal conditions but crucial for survival under specific environmental challenges. The molecular function — the enzymatic activity — doesn’t change, but its contextual deployment determines its phenotypic significance.

Gene dosage effects further illustrate this separation of molecular function from phenotypic consequence. Many genes show dose-dependent phenotypic effects, where moderate overexpression enhances fitness, higher levels become neutral, and extreme overexpression proves toxic. The molecular function of the encoded protein doesn’t change across this dosage range, but the phenotypic consequences vary dramatically based on expression levels and cellular context. This explains why the same gene can be beneficial in one genomic background (where its expression is appropriately tuned) but detrimental in another (where regulatory mechanisms produce inappropriate expression levels).

Horizontal gene transfer: testing molecular functions in new contexts

Horizontal gene transfer (HGT) provides perhaps the most direct test of the context-dependency principle. When genes move between organisms through HGT, their molecular functions typically remain intact — the same enzymatic activities, binding specificities, and structural properties. However, their phenotypic effects in the new genomic context can range from highly beneficial to severely detrimental, depending on how well the molecular function integrates into the existing cellular systems.

This process mirrors how children learn word meanings through contextual exposure rather than explicit definition. When a gene arrives in a new genome through HGT, evolution doesn’t “look up” its predetermined function and decide whether to keep it. Instead, the gene’s appropriateness is evaluated through its actual performance in the new context — does its expression enhance fitness, have no effect, or prove costly? The molecular function provides the raw material, but the selective value emerges from contextual integration.

Antibiotic resistance genes provide compelling examples of this principle. A β-lactamase gene encodes a protein with the same molecular function — hydrolysing β-lactam antibiotics — regardless of which organism carries it. However, the phenotypic effect of acquiring this gene varies dramatically across recipients. In organisms with appropriate regulatory machinery to express the gene when needed, effective efflux systems to handle the products, and metabolic capacity to support the energetic costs, the gene provides strong resistance. In organisms lacking these contextual factors, the same molecular function may provide little benefit or even impose costs through inappropriate expression or metabolic burden.

Distinguishing function from effect

Moving beyond deterministic gene-phenotype thinking requires clearly distinguishing between molecular function and phenotypic effect. Molecular function refers to the stable biochemical or biophysical properties of gene products — enzymatic activities, binding specificities, structural roles, regulatory interactions. These functions can often be characterised in isolation and remain relatively consistent across contexts. Phenotypic effect refers to the observable consequences of expressing, overexpressing, or eliminating these molecular functions in specific cellular, organismal, and environmental contexts.

This distinction immediately resolves many apparent paradoxes in genomics. When we observe that the same gene has different effects in different organisms, we’re not seeing context-dependent molecular functions but context-dependent deployment and integration of stable molecular functions. The glucose kinase activity of a hexokinase enzyme remains constant whether the protein is expressed in yeast or human cells. However, the phenotypic consequences of overexpressing or deleting this enzymatic activity depend on the metabolic organisation, regulatory networks, and environmental demands of each organism.

Understanding genes as encoding molecular functions that are deployed in context-dependent ways provides a more accurate framework for predicting and engineering biological systems. Instead of asking “What does this gene do?” — which conflates function with effect — we should ask “What molecular function does this gene encode?” and “How is this function deployed and integrated in different contexts?” The first question has increasingly clear answers as we understand protein structures and mechanisms. The second requires understanding the complex regulatory and metabolic networks that determine when, where, and how molecular functions are expressed and utilised.

The grammar of genomic context

If molecular functions are like stable word referents, then genomic context provides the grammatical and semantic framework that determines how these functions are combined and deployed to produce phenotypic “meanings”. This genomic grammar operates at multiple scales, from local regulatory elements that control individual gene expression to global network properties that determine how molecular functions integrate into cellular processes.

Regulatory architecture provides the immediate syntactic context for molecular functions. Promoters, enhancers, and silencers determine when and where genes are expressed, while post-transcriptional mechanisms control protein abundance and activity. The same coding sequence — the same molecular function — can produce entirely different phenotypic effects depending on its regulatory environment. Like words that change meaning based on syntactic context (consider “run” in “run the program” versus “run to the store”), genes produce different phenotypic effects based on their regulatory deployment.

Metabolic and regulatory networks provide the broader semantic context within which molecular functions gain phenotypic significance. An enzyme’s molecular function only becomes phenotypically meaningful within the context of metabolic pathways, substrate availability, and competing reactions. Similarly, a transcription factor’s molecular function — its ability to bind specific DNA sequences — only produces phenotypic effects within the context of the genes it regulates and the cellular conditions that determine its activity. The network context determines whether a molecular function enhances fitness, has no effect, or proves detrimental.

Evolution as context-dependent evaluation

Understanding evolution as evaluating molecular functions within genomic contexts rather than selecting for predetermined effects resolves many puzzles in evolutionary biology. When horizontal gene transfer introduces a new gene into a genome, evolution doesn’t consult a catalogue of gene functions to decide whether to retain it. Instead, the gene’s retention depends on its actual performance in the new context — how well its molecular function integrates into existing networks, whether its expression proves beneficial or costly, and how it affects overall organismal fitness.

This context-dependent evaluation explains the maintenance of genetic variation in populations. The same molecular function can be beneficial in some genomic backgrounds or environments but neutral or deleterious in others. Rather than selection fixing a single optimal variant, we see the maintenance of multiple variants whose relative fitness depends on context. This process generates the raw material for adaptation to changing environments — not through creating new molecular functions de novo, but through deploying existing functions in new contexts or combinations.

The evolution of gene regulation becomes particularly important in this framework. While molecular functions may be relatively conserved across related organisms, regulatory patterns show much more rapid evolution. This makes sense if evolution is primarily tuning the deployment of molecular functions rather than the functions themselves. Changes in gene expression, alternative splicing, and regulatory network architecture can dramatically alter phenotypic effects without modifying the underlying molecular functions. The evolution of complexity involves not just acquiring new molecular functions but developing increasingly sophisticated ways of deploying and integrating existing functions.

Why function-based predictions succeed while effect-based predictions fail

The distinction between molecular function and phenotypic effect explains the mixed success of genomic prediction efforts. When we focus on predicting molecular functions — what enzymatic activities a protein has, what DNA sequences a transcription factor binds, what structural role a protein plays — our predictions are increasingly successful. Advances in protein structure prediction, comparative genomics, and functional annotation have made it possible to predict molecular functions with reasonable accuracy from sequence information alone.

However, when we attempt to predict phenotypic effects — what happens when we overexpress this gene, whether knocking out this gene will be lethal, how this mutation will affect disease risk — our predictions remain frustratingly inaccurate. This is not a technical limitation that will be overcome with more data or better algorithms; it reflects the fundamental context-dependency of phenotypic effects. The same molecular function can have radically different phenotypic consequences depending on genomic background, regulatory context, and environmental conditions.

Consider drug development efforts that focus on molecular targets. When we identify an essential molecular function in a pathogen and design drugs to inhibit that function, we often achieve the desired biochemical effect — the drug successfully inhibits the target enzyme or blocks the intended interaction. However, the phenotypic effect — whether the drug actually kills the pathogen — depends on contextual factors: whether alternative pathways can compensate, whether the organism can upregulate the target, whether the drug reaches appropriate concentrations in relevant tissues. Success requires understanding not just molecular functions but how those functions are deployed and integrated in specific pathological contexts.

Toward context-aware genomic models

The path forward requires models that explicitly distinguish between molecular functions and their contextual deployment. Rather than trying to predict specific phenotypic outcomes from genotype alone, we need models that estimate how molecular functions will be deployed in specific contexts and what phenotypic effects might emerge from those deployment patterns. This requires integrating genomic information with data about regulatory states, metabolic conditions, and environmental factors.

Machine learning approaches offer promising directions, but they must be designed to capture context-dependency rather than assuming fixed gene-phenotype relationships. Models should be trained on data from multiple genomic backgrounds, cellular conditions, and environmental contexts to learn how the same molecular functions produce different effects in different situations. Rather than learning that “gene X causes phenotype Y”, these models should learn that “molecular function X, when deployed in context Z, tends to produce phenotypic effect Y with probability P”.

This approach has immediate applications in personalised medicine, where the same drug targeting the same molecular function can have dramatically different effects in different patients. Rather than assuming that all patients with mutations in a particular gene will respond identically to a targeted therapy, context-aware models could incorporate information about genomic background, expression profiles, and environmental factors to predict likely responses for individual patients. The goal is not perfect prediction — which is impossible in context-dependent systems — but improved probability estimates that account for contextual variation.

Engineering for context-dependency

Synthetic biology has traditionally pursued the goal of creating biological systems with the predictability of engineered devices. The standard approach assumes that biological components have fixed functions that can be combined according to rational design principles. Yet synthetic biology consistently struggles with context-dependency — circuits that work in one organism fail in another, components that should be orthogonal interfere with each other, and systems that should be stable drift over time.

Rather than viewing context-dependency as a problem to be solved, future synthetic biology approaches should embrace it as a fundamental feature of biological systems. Instead of trying to create context-independent circuits, we could design systems that respond appropriately to cellular context. Instead of assuming components have fixed functions, we could design modular systems where the same molecular functions can be deployed in different ways depending on cellular state or environmental conditions.

This might involve creating libraries of regulatory variants that allow the same molecular functions to be deployed across different cellular contexts. It might mean designing circuits that sense cellular state and adjust their behaviour accordingly. It might require accepting that biological systems will always have some level of context-dependency and designing for robustness across contexts rather than precision in a single context. The future of synthetic biology lies not in eliminating biological complexity but in understanding and harnessing the principles by which molecular functions are contextually deployed.

Conclusion

The recognition that genes encode stable molecular functions whose phenotypic effects emerge from contextual deployment represents a crucial refinement of biological understanding. This perspective resolves the apparent paradox between strong natural selection and variable gene effects by recognising that selection acts on the appropriateness of molecular functions within specific contexts rather than on fixed gene-phenotype relationships.

Just as children learn word meanings through contextual exposure rather than dictionary memorisation, evolution evaluates genes through their performance in genomic contexts rather than through predetermined functional specifications. This process generates biological systems that are both robust and flexible — robust because they rely on well-tested molecular functions, flexible because these functions can be deployed in context-appropriate ways.

This framework explains why essential genes vary between organisms (same molecular functions, different cellular contexts), why synthetic circuits fail upon transfer (molecular functions encounter unfamiliar regulatory environments), and why genotype-phenotype predictions remain challenging (phenotypic effects emerge from contextual deployment, not molecular functions alone). These observations reflect the fundamental principles by which biological information is organised and deployed rather than technical limitations to be overcome.

The path forward requires experimental approaches that systematically explore how molecular functions are deployed across different contexts, theoretical frameworks that distinguish between function and effect, and practical applications that account for context-dependency. By understanding biological systems as context-dependent deployment of molecular functions rather than deterministic gene-phenotype mappings, we open new possibilities for prediction, engineering, and medicine that work with biological principles rather than against them.

The genome encodes not a fixed program but a flexible system for deploying molecular functions in response to context — much like how language enables flexible communication through the contextual deployment of stable semantic elements. This perspective reveals biology as even more sophisticated than deterministic models suggest. It is not a machine executing fixed instructions, but a system capable of generating appropriate responses to an unpredictable world through the intelligent deployment of molecular functions in context.