“Protein archaeology”: Understanding how proteins evolve

17 December 2017

Since their emergence about 3.7 billion years ago, proteins have been key to life as we know it. They come in different shapes and are involved in various biological functions, including metabolism, immune response, muscle movements, and more. But how did proteins emerge and continue to evolve? Their functions are varied and complex, but they are all made of the same building blocks – amino acids connected in long chains, that fold to create three dimensional structures.

Tracing proteins through “remains”

Through the process of evolution, amino acid segments from existing proteins are grafted and stitched together to form new proteins, sometimes with new uses and functions. We conclude this from the traces they leave behind - in the form of reused segments. By tracing these recurring segments, researchers are able to decipher the origin of proteins, similarly to archeologists tracing human history.

Previous attempts to detect segment reuse focused on ‘domains’, or segments comprising roughly 100 amino acids. Domains carry practical importance, and researchers try to infer from data on one domain to its evolutionarily-related domains in other proteins. For example, a study might use mutations in a protein domain in the fruit fly to make inferences regarding domains in humans.

In practice, to achieve this, researchers rely on popular databases that segment proteins into domains based on reuse. While the details of how to segment a given protein into domains are often in conflict among these databases, two important decisions are common to all databases: (a) each amino acid is part of exactly one domain, and (b) the domain lengths are approximately 100 amino acids.

These decisions were made when the domain databases were initiated in the early 1990's, at a time when data was very limited. Nonetheless, they have had a far-reaching effect in that studies of proteins are interpreted in view of knowledge on those proteins’ domains. In recent years, the size of the Protein Data Bank has vastly increased, leading researchers to believe that the domain-centric approach should be revisited.

A new approach to proteins

A new study by Dr. Rachel Kolodny of the Department of Computer Science at the University of Haifa, and Dr. Sergey Nepomnyachiy and Prof. Nir Ben-Tal of the Department of Biochemistry and Molecular Biology at Tel Aviv University's George S. Wise Faculty of Life Sciences has done just that. The researchers systematically examined reuse patterns of segments of diverse lengths among proteins. To do this, they developed efficient algorithmic tools and a computational pipeline, and applied them to a representative set of over 20,000 proteins.

Their analysis revealed reused segments of lengths ranging between 35 and 200 amino acids, which they called ‘themes’. The authors showed that reuse is highly prevalent, and that the extent of reuse increases when the minimal threshold for the length of a ‘theme’ decreases.

Interestingly, reuse of themes of 100 amino acids does not stand out, refuting the assumption that domains are the ‘atomic’ evolutionary units. Rather, complicated evolutionary traces emerge. For example, the researchers observed recursive patterns akin to ‘Russian nested dolls’, in which long themes that appear in a few proteins encompass shorter themes, which appear in larger numbers of proteins.

In other words, the same amino acid belongs both to a short theme, shared by many proteins, and to longer themes shared by fewer proteins. These results highlight the wealth of evolutionary information that becomes available when relaxing the (arbitrary) constraints imposed by the current domain databases.

The new study has raised considerable interest, including the following tweet from the Savage lab (Berkeley): “New look at reuse of sequences over protein evolution from Ben Tal and Kolodny! Time to rewrite textbooks?”

Screencap of a tweet by Berkeley's Savage Lab

The researchers are currently working on identifying connections between themes and protein function, such as DNA/RNA binding, involvement in disease, and drug binding.

They also examine the use of themes for protein design, to better engineer proteins for industrial use. Because the ‘themes’ have the proven ability to ‘hop between proteins’ they are natural candidates for building blocks. Perhaps, grafting them will allow introduction of new functions into proteins.

With that, current industrial pipelines, e.g., to manufacture computer chips, could perhaps be replaced by industrialized proteins. For example, certain proteins (mostly in plants and bacteria) have the capability to convert sunlight into more useful forms of energy. With proper design, these could power the chip-production process.