Transformers

Modern Sequence Analysis

Nicholas Gale and Stephen Eglen

Traditional Sequence Analysis

Sequence analysis has been typically performed by recurrent neural networks.
Exploding/vanishing gradients from recursion.
Information decay leading to short memory.
O(n) complexity in sequence length.

Some solutions.

LSTMs: forget gates to preserve information.
Gated Recurrent Units: similar to LSTM forget gates but no output.
Other proposals: none ideal due to O(n) complexity.

Transformers

“Attention is all you need” (2016)
Transformers are the latest development in large scale sequence analysis.
Address many problems with RNNs
Workhorse behind many “magical” applications e.g. voice assistant and language translation.

Attention: bare bones

Transformers leverage the idea of attention: not new.
Attention is computed between all elements in a sequence: a weighted relationship between locations.
All units are considered independently: massive parallelisation.

Attention

Simple attention relies on nothing more than a Euclidean projection: dot product.

Simple attention: computation

For each vector pair compute the dot product between them and normalise over all pairs with softmax. \[ a_{ij} = x_i^T x_j \] \[ A_{ij} = \frac{\exp(a_{ij})}{\sum_j \exp(a)_{ij}} \]

Attention Retrevial

An input is weighted by the row of the attention matrix corresponding to its index.
The ouput is the weighted sum of all vectors by this attention row: \[ y_i = \sum_j A_{ij} x_j \]

Querys, Keys, and Values

Imagine the vectors are one-hot batched: \[(0,0,1, \ldots, 0,0)\]
The product \(x_i^Tx_j\) will be one only in the indexes \(i=j\). Multiplying by the potential valus gives \(x_i\).
This operation is acting like a look-up table.
Transformers inherit the language and call these keys, queries, and values.

Attention Generalised

We would like to let the keys, queries, and values not be fully determined by Euclidean representation.
We imagine them as linear transforms of the total dictionary embedding values into a new space: \[k_i = W^Kx_i, q_i = W^Qx_i, v_i = W^Vx_i\]
This allows us to compress our key/query/value representation to a lower dimensionality.

Transformers Bare Bones

The transformer model is composed of two attentional models: an encoder and decoder.
Input is in the form of vectors of tokens e.g. genome sequence.
The encoder transforms input to an encoded attentional sequence.
The decoder autoregressively uses the encoder ouput to an ouput sequence.
The output sequence is decoded with a decoder dictionary e.g. amino acids.

Transformers.jl

Transformers.jl is built on top of Flux.
It has similar GPU support (and similiar drawbacks).
Similar grammar but there are subtleties - documentation is helpful!
Open source contribution - documentation could be added to! Anyone can do this.

Transfomer Codon Task

A codon is a group of three base nucleotides that encodes an amino acid.
Can we try and predict amino acid base pairs without knowledge of codons?

using CSV, DataFrames, Transformers, Flux, Random, ProgressMeter

amino_codon = Dict( # the amino acid to codon relationship
    "A" => ["GCU", "GCC", "GCA", "GCG"],
    "R" => ["CGU", "CGC", "CGA", "CGG", "AGA", "AGG"],
    "N" => ["AAU", "AAC"],
    "D" => ["GAU", "GAC"],
    "B" => ["AAU", "AAC", "GAU", "GAC"],
    "Q" => ["CAA", "CAG"],
    "E" => ["GAA", "GAG"],
    "Z" => ["CAA", "CAG", "GAA", "GAG"],
    "G" => ["GGU", "GGC", "GGA", "GGG"],
    "H" => ["CAU", "CAC"],
    "I" => ["AUU", "AUC", "AUA"],
    "L" => ["CUU", "CUC", "CUA", "CUG", "UUA", "UUG"],
    "K" => ["AAA", "AAG"],
    "M" => ["AUG"],
    "F" => ["UUU", "UUC"],
    "P" => ["CCU", "CCC", "CCA", "CCG"],
    "S" => ["UCU", "UCC", "UCA", "UCG", "AGU", "AGC"],
    "T" => ["ACU", "ACC", "ACA", "ACG"],
    "W" => ["UGG"],
    "Y" => ["UAU", "UAC"],
    "V" => ["GUU", "GUC", "GUA", "GUG"],
    #"1" => ["AUG"],
    #"9" => ["UAA", "UGA", "UAG"],
);

Pre-Processing

Token labels, a start/end symbol, and an unknown symbol are encoded into a lookup table (Vocabulary).
The input sequence tokens are first wrapped by “Start” and “End” tokens and encoded by the Vocabulary.
This encoded representation is semantically embedded into vectors of length \(D\).

pre_process(v) = cat("1", v..., "9"; dims=1)
labels_y = cat("1", unique(train_y[1])..., "9", "0"; dims=1)
labels_x = cat("1", unique(train_x[1])..., "9", "0"; dims=1)
tokenizer_x = Transformers.Vocabulary(labels_x, "0")
tokenizer_y = Transformers.Vocabulary(labels_y, "0")
encoded_x = tokenizer_x.(pre_process.(train_x))
encoded_y = tokenizer_y.(pre_process.(train_y))
embed_x = Transformers.Embed(d, length(tokenizer_x))
embed_y = Transformers.Embed(d, length(tokenizer_y))

Positional Encoding

Attention lets us forget sequence order to parallelise computation.
Sequence orders are important e.g. gene sequence cant be scrambled.
A positional encoding function is used to inject this order into the learning.

Positional encodings can be learnt but typically use a function: \[ p_{2k}(i) = cos(i/10000^k) \] \[ p_{2k+1}(i) = sin(i/10000^k) \]

position_embed = Transformers.PositionEmbedding(d)

function embeddingx(x)
  sem_em = embed_x(x, inv(sqrt(d)))
  em = sem_em .+ position_embed(sem_em)
  return em
end

Keys, Queries, and Values

The keys, queries and values are remembeddings of the information into a different representational subspace.
These are the key learnable parameters in the model.
An efficient subspace representation of information allows for impressive generalisability.

Attention Matrices

Each input vector forms a row and these can be concatenated for efficiency into a matrix: \(X\) (L x D)
K represents the keys, Q the querys, V the values.

Transformer Attention Head

The Transformer attention mechanism is given by: \[A(K,Q,V) = \text{softmax}\left(\frac{K^TQ}{\sqrt{D}}\right) \odot V\] \[K = W^KX, Q = W^QX, V = W^VX\]
Softmax is calculated row by row. \(\odot\) is element-wise product. Dimensional scaling \(\sqrt(D)\) for stability.

Multi-head attention

A transformer layer can have multiple heads analag. convolutional filter.
Each of these heads will learn to focus on different semantic relationships.
This can be efficiently encoded by simply concatenating each individual head. Usually, \(n \times h_d = D\).
Head dimension is not a hyper-parameter.

Residual and Normalisation

The outputs of the attention mechanism are mutliplied by matrix \(W^O\) to remebbed vectors into length D.
\(W^O\) is another learnable matrix.
The pre-attention inputs (residuals) are added to the re-embedded attention transformed inputs.
This combined vector is then layer-normalised.

Feed Forward Network

The normalised self-attention and residuals are passed through a feed-foward network: \[F(x) = \text{ReLu}(W_1x + b_1)W_2 + b_2\]

Inner Dimensions

The inner-dimension is independent of the embedding dimension e.g. 2048.
The feed forward network is shared between all tokens.
The residuals are added to the ouput of the FFN and layer-normalised.

Encoder Block/Layer

The operations of self-attention, feed forward networks, and layer-normalisation make an encoder block.
This can be thought of as a layer in a regular NN.
The Encoder is formed of several encoder blocks e.g. 6

Tencoder = Flux.Chain(
    embeddingx, 
    Transformer(d, h, hd, innerd), 
    Transformer(d, h, hd, innerd), 
    Transformer(d, h, hd, innerd)
)

Decoder Block

A decoder block has a self-attention, an encoder-decoder-attention, and a feed-forward network.
The encoder-decoder-attention keys and values are constructed using output from encoder (and learnable matrices).
These keys and values are shared across all decoder blocks.

Dec1 = TransformerDecoder(d, h, hd, innerd)
Dec2 = TransformerDecoder(d, h, hd, innerd)
Dec3 = TransformerDecoder(d, h, hd, innerd)
ffn = Transformers.Positionwise(Dense(d, length(tokenizer_y)), softmax)

function Tdecoder((y, mx))
    emy = embeddingy(y)
    d1 = Dec1(emy, mx)
    d2 = Dec1(d1, mx)
    d3 = Dec1(d2, mx)
    ffn(d3)
end

Sequence Generation

Sequences are generated with a start symbol and terminated with a stop symbol.
The outputs are fed through the network as inputs until stopping.
The output vector is fixed at an arbitrary length i.e. 1024
Future outputs are masked with -Inf to prevent left flowing information: autoregressive.

function transcribe_protein(x)
    seq = [tokenizer_y("1")]
    tok = ["1"]
    enc = Tencoder(x)
    for i = 1:2*length(x)
        dec = Tdecoder((seq, enc))
        seqnext = argmax(vec(dec[1:end-1,end]))
        append!(seq, seqnext)
        toknext = Transformers.decode(tokenizer_y, seqnext)
        push!(tok, toknext)
        toknext == "9" && break
    end
    tok
end

Loss Functions

The final step is to softmax outputs to generate a probability distribution against a dictionary/vocabulary.
The natural loss function is crossentropy.
Loss functions may be arbitrary.

Regularisation

The original paper used dropout for the layer parameters and label smoothing.
Dropout improves stability and convergence time.
Label smoothing increases model perplexity (at the cost of labelled accuracy).

function loss(xdata, ydata)
    L = 0
    for i in 1:length(ydata)
        ytarget = Flux.label_smoothing(Flux.onehot(tokenizer_y, ydata[i]), 0.2f0)
        ypred = Tdecoder((ydata[i], Tencoder(xdata[i])))
        L += Flux.crossentropy(ytarget, ypred)
    end
    return L
end

Model Summary

The encoder takes positional and contextual inputs and the decoder autoregressively produces outputs.
The encoder and decoders use attention heads and a feed-forward network to perform the learning.
The attention mechanism transforms embedding into a different subspace through keys/queries/values matrices.
Keys, Queries, Values are learned and represent optimal information relationships in the problem context.

Transfer Learning

Transformer models are large - amongst the largest neural networks.
Difficult to train without industrial resources.
Employ transfer learning: import pre-trained weights on a similar task and fine-tune to current task.

Transformers in Biology

Transformers are in relative infancy - lots of work to be done.
The obvious candidate is sequence analysis: genome and protein.
Some interesting developments: gene transcription factors using Enformer (Deep Mind)
Protein Prediction tasks.

Transforming the Language Of Life.

Protein prediction can be done classically: HMMs and BLAST. Exponentially computationally expensive.
CNNs and RNNs are computationally more efficient but task depedent and dont generalise.
Authors propose Transformer model PRoBERTa: pre-trained agnostic amino acid sequence representation.
Protein prediction tasks: binary PPI and protein family classification.

Problems

Authors managed to achieve state of the art performance on target tasks (tasks not too important)
The resulting model is vastly more computationally efficient: 128 GPUs for 4 days => 4 GPUs for 18 hours.
Still difficult to reproduce. GPUs are top of the line and few people have access to this many.
Transformers in general are large models and pose a reproducibility problem.

Summary

Transformers are a complex but powerful sequence analysis architecture.
Julia offers support through Transformers.jl.
Very successful, but computationally demanding and not easily reproducible.
Relatively unplumbed in Biology.

References

Attention is all you need; Vaswani et. al. (2017)

Effective gene expression prediction from sequence by integrating long-range interactions; Avsec et. al. (2021)

Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks; Nambiar et. al. (2020)