Intro:
To model the language of life (i.e., the language of proteins) in a data-efficient and cost-optimized modality, we need a meaningful representation encompassing its structural and functional information. Proteinea has developed -and is continuously improving- Ankh, a revolutionary base protein language model that will change the landscape of protein engineering with AI forever
Blog contents
- How’s Ankh Brought to Light?
- But What is a “Representation” or a “Language Model”?
- What’s Wrong with Existing Language Models?
- How’s Ankh Different?
- Why Call a Protein Language Model “Ankh”?
- How’s Ankh Built?
- How’s Ankh Tested?
- What Does This Mean for Protein Engineering?
- What’s Next?
How’s Ankh Brought to Light?
Ankh is supported and sponsored by Google via Google Research Innovator and Google TPU Cloud Research Credit Programs. Furthermore, the models trained in this work could not have been easily publicly available without the support of the HuggingFace team. Finally, the review and affirmation of the work’s research objectives as well as the verification of its results are done in collaboration with the Technical University of Munich (TUM) and Columbia University.
But What is a “Representation” or a “Language Model”?
There is an unimaginable number of 3D structures for a protein sequence to potentially fold itself. Hence, there exists a huge gap between the number of proteins whose primary structures (amino acid sequences) are available and those whose ground-truth 3D structures are available. Although there now exist several contributions aiming to bridge this gap by performing 3D structure prediction of protein sequences, modeling via these contributions is significantly more computationally expensive and remains by far a prediction. Hence, being able to extract information about how the protein folds -and accordingly functions- just from its sequence is quite valuable!
We know computers process 1’s and 0’s. Therefore, to input a protein sequence into a digital model, it eventually needs to be in that format. But think of it, to represent every input token (i.e., unit) with just 1’s and 0’s, you would need a lot of those to cover all the unique input tokens you have. Not just that, your representation would literally just be inputting the tokens without actually interpreting what they mean or their relationship to each other. In analogy to protein language, we wouldn’t just want a representation to input a protein sequence, we would want a representation to tell us information about this sequence. For these pieces of information to be valuable, they need to denote some insights into the protein’s structure as this structure maps to the protein function. This is the very core of what protein embeddings do.
Protein embeddings are protein vector representations that are proven to denote information not only on the meaning of individual amino acids but on the meaning of the sequence as a whole. When analyzed by a variety of credible contributions, this information is found to embody the protein structure and function (i.e., when two proteins perform the same function, their protein embeddings are traced to be somewhat similar). This, in turn, revolutionizes the way we model proteins because we do not just view them as amino acids but rather as high-dimensional vectors learned by large-scale neural networks trained over large-scale protein databases. These neural networks are often referred to as language models, referencing their objective of “modeling the language”.
Protein language models are hence models whose learning objective is to model (i.e., reflect) the language of proteins. In a deep learning context, these models evolved from statistical and static into dynamic neural networks that produce representations that change depending on the context (i.e., the representation of a given amino acid changes depending on the rest of the amino acids). These representations are then utilized to opt for higher modeling performance in a variety of protein downstream applications (e.g., predicting a specific protein attribute, generating a better protein with respect to a set of attributes, etc.). To learn these representations, language models are often pre-trained on large-scale and unsupervised datasets (datasets that consist of sequences without their annotations or labels), large and comprehensive enough to be able to actually teach the model the designated language. Borrowing an analogy from natural languages, for an English language model to be able to say, write poetry, it initially needs to learn English. For it to do this, it needs a whole lot of English records for pre-training to then be able to write poetry, perform grammar corrections, or basically do whatever English-driven task we aspire to achieve without a similarly-massive amount of data. This concept is referred to as transfer learning, transferring the knowledge obtained in a general-purpose domain into a more specialized domain with potentially additional constraints.
What’s Wrong with Existing Language Models?
Protein language modeling has undergone the adaptation of pre-trained language modeling leveraging general-purpose unsupervised sequences into a more specialized and constrained range of downstream tasks similar to natural languages’. In fact, the sequence-structure gap and its causation of task-specialized or task-annotated data sparsity have promoted the direction of utilizing even larger pre-training architectures. The direct proportion between a model’s size and the richness of its learned representations as well as its performance in downstream tasks is rather encouraged -falsely- by observing language models of a massive number of parameters trained on a massive number of steps still undergoing notable learning gradient (i.e., showing signs it can potentially still learn further). We can observe this trend by comparing the size associated with early protein language models (∼ 106) to the most recently-released ones (∼ 109). This trend can be observed in a large number of protein language models whose most recent state-of-art is Meta AI’s ESM-2 suite of protein language models that also showcase a performance-proportional increase in size from 650M, to 3B, to 15B parameters!
Despite the proven emphasis on how scaling-up protein language models correspond to better performance, almost no emphasis is made on the disadvantageous effects, embodied in the extensive computational cost. Furthermore, no emphasis is done on the importance of distinguishing natural languages from protein languages in the language model design and training. These deficiencies collectively raise the research innovation entry barrier and constrain it to scalability.
How’s Ankh Different?
Ankh seeks the limits of performance and optimization instead of the limits of scaling up its size by prioritizing accessibility and enabling higher performance with data-efficient, cost-reduced, and knowledge-guided optimization. Ankh is the first general-purpose protein language model trained on Google’s TPU-v4 surpassing the state-of-the-art performance with less than 10% of the parameters for pre-training, 7% for inference, and 30% of the embedding dimension. This top performance is showcased via a representative range of structure and function benchmarks where Ankh excels.
Why Call a Protein Language Model “Ankh”?
We title our work "Ankh" (i.e. an Ancient Egyptian symbol denoting the key of life) in analogy to how our model "unlocks" the language of life via learning superior representations of its "letters", the amino acids.
How’s Ankh Built?
Ankh is built on the notion of deriving insights from protein-specific computational experimentation to derive the best protein-specific model design while promoting accessibility to research innovation via attainable resources.
The designated experimentation retained a single independent variable ranging from Masking, Architecture, and Dataset:
- Masking is a pre-training objective that guides the model’s learning of token representations by demanding it to predict a random sample of input tokens that is usually replaced by a [MASK] placeholder(s). This class of model experimentation aimed to investigate the impact of two masking-related parameters, Masking Strategy and Masking Probability
- Masking Strategy indicates the means by which we decide which tokens to mask and which to keep unmasked. Motivated by the skewed distribution of amino acid tokens in protein sequences in addition to the redundancy in the database, Ankh tested different means of masking strategies to ensure protein-specific adoption
- Masking Probability indicates the ratio of tokens to be masked out of the entire sequence length. The default masking probability, borrowed from natural languages, is 15%. Ankh experimented with additional values to opt for protein specialization
- Since Ankh utilizes an encoder-decoder transformer architecture, the architecture variations targeted correspond to the number of encoder and decoder layers, different combinations of depth and width variations, non-linearity function (i.e., activation function), and the means by which the model learns the order of tokens (i.e., positional embeddings)
a) Number of Encoder-Decoder Layers refers to altering the ratio between each sub-network’s layers due to its trade-off impact on the meaningfulness of the embeddings, computational complexity, and downstream tasks
b) Depth corresponds to the number of layers corresponding to the encoder and decoder’s layers. Width, however, corresponds to the embedding dimension, both in the transformer’s context
c) Activation Function is the function introducing nonlinearity to the forward layers
d) Relative Positional Embeddings is a method of providing indexing and positional information in the sequence to the transformer architecture in a means that isn’t constrained by pre-determined dimensions
3. Dataset refers to the pre-training dataset used to pre-train the language model
The aforementioned independent variables were placed in an iterative empirical framework utilizing the top-performing version of each set of experimentation in the subsequent set. Furthermore, the indicated experimentation retained training each variation for 2 epochs while also abiding by approximately the same total number of parameters per experiment to avoid computational bias.
How’s Ankh Tested?
To assert the dominance of Ankh, principal benchmarking tasks that fall into three groups, Protein Function Prediction, Protein Structure Prediction, and Protein Localization Prediction were utilized. Ankh further provides a Protein Generation Analysis on High-N and One-N (N here refers to the number of input sequences available for the model to learn from) scales where it succeeds in learning evolutionary conservation-mutation trends (i.e., what sequence portions are evolutionary preferred and hence should be conserved and what portions are not evolutionarily frequent and can be mutated) and introducing diversity while retaining key structural-functional characteristics (i.e., increasing the functional ranges of the protein without affecting its key original functionalities).
- Protein Function Prediction aims to evaluate the ability of protein embeddings in capturing the functional scores of three critical design parameters of protein engineering, fitness (i.e., the extent of a desired functionality with respect to a different combination of amino acid mutations), solubility (i.e., whether or not a given protein is soluble due to its connotations for therapeutics and use in diagnosis) and fluorescence (i.e., the fluorescence intensity of green fluorescent protein mutants due to its usage in tracking the existence of proteins in cell lines and living organisms)
- Protein Structure Prediction aims to evaluate the ability of the sequence-based embeddings of a protein to encompass accurate information about its structure. The large majority of biological parameters of a protein can be inferred from its structure as a sequence of amino acids can theoretically fold into a large number of possible structures. Consequently, Ankh addresses three critical prediction tasks in this set, secondary structure (i.e., predicting what shape every amino acid residue in a given sequence coils into due to the significant functional information it holds), fold (i.e., predicting what fold amongst a large number of the possible folds a specific sequence would fold into due to its crucial influence on its function), and contact prediction (i.e., predicting which pairs of amino acid residues are in contact and hence affect each other as well as the entire protein).
- Protein Localization Prediction aims to evaluate the ability of protein embeddings to capture where a protein is expected to accumulate in the cell. This attribute is significant for understanding protein functions, especially in disease target identification studies
Indeed, the detailed results found in the paper demonstrate the outperformance of the Ankh suite, with no exceptions, over the state-of-the-art in each and every one of these tasks.
- Protein Generation Analysis aims to generate synthetic variants of natural proteins. This group of tasks has a crucial emphasis in terms of Ankh’s applicability in protein engineering applications. Ankh is evaluated with two input datasets representing two different settings and scales in protein engineering: family-based and single sequence-based variant generation
Indeed, Ankh’s generated protein sequences have shown an impressive ability in learning evolutionary conservation-mutation trends, a trade-off that can be controlled by a user-inputted parameter referred to as temperature warping so that the extent of the desired creativity can be chosen depending on the application.
Furthermore, the generated sequences span a wide range of sequence identity scores (i.e., the model does not just generate similar sequences) with observable functional diversity.
In fact, the generated sequences do not just provide functional diversity amongst them but in comparison to the natural dataset (i.e., the generated sequences do not just retain the primary functional domains but expand to domains that were not even present in the natural dataset).
Astonishingly, the generated sequences are found to introduce this diversity while retaining structural -and accordingly functional- identity from as little as a single starting sequence.
What Does This Mean for Protein Engineering?
The release of Ankh marks an unprecedented moment in the history of protein engineering using AI. Our results suggest that state-of-the-art performance can be reached and surpassed by significantly less computational
power. This suggestion implies the necessity of criticizing the highlighted correlation between model performance and needed computational power, embodied in either model or data sizes. Instead, Ankh suggests visualizing this correlation as a trade-off, highlighting the immense cost of directly scaling model/data size to improve model performance. On the other side of the trade-off, Ankh proposes knowledge-guided means of
optimization whose prerequisites revolve around needed protein knowledge as well as the extra mile in optimizing both the software and hardware components of the model life cycle. Capitalizing on those proposals will change both the performance of protein engineering tools as well as the accessibility of its research innovation as we know them.
What’s Next?
Ankh is proposed as an initial version of an optimized general-purpose protein language model. This version is meant to serve as a proof-of-concept for data-efficient, cost-reduced, and knowledge-guided means of optimization. This proof-of-concept can be viewed as a pre-training base model that can be specialized into high-impact and high-value protein modeling tasks in our future work (e.g., full atomic resolution 3D-structure prediction, protein generation, etc.) with task-specific optimization and detailed analysis.
Contact us at hello-ankh@proteinea.com for more information on Ankh and partnerships