Sequence analysis has been typically performed by recurrent neural networks.
Exploding/vanishing gradients from recursion.
Information decay leading to short memory.
O(n) complexity in sequence length.
Some solutions.
LSTMs: forget gates to preserve information.
Gated Recurrent Units: similar to LSTM forget gates but no output.
Other proposals: none ideal due to O(n) complexity.
Transformers
“Attention is all you need” (2016)
Transformers are the latest development in large scale sequence analysis.
Address many problems with RNNs
Workhorse behind many “magical” applications e.g. voice assistant and language translation.
Attention: bare bones
Transformers leverage the idea of attention: not new.
Attention is computed between all elements in a sequence: a weighted relationship between locations.
All units are considered independently: massive parallelisation.
Attention
Simple attention relies on nothing more than a Euclidean projection: dot product.
Simple attention: computation
For each vector pair compute the dot product between them and normalise over all pairs with softmax. \[ a_{ij} = x_i^T x_j \]\[ A_{ij} = \frac{\exp(a_{ij})}{\sum_j \exp(a)_{ij}} \]
Attention Retrevial
An input is weighted by the row of the attention matrix corresponding to its index.
The ouput is the weighted sum of all vectors by this attention row: \[ y_i = \sum_j A_{ij} x_j \]
Querys, Keys, and Values
Imagine the vectors are one-hot batched: \[(0,0,1, \ldots, 0,0)\]
The product \(x_i^Tx_j\) will be one only in the indexes \(i=j\). Multiplying by the potential valus gives \(x_i\).
This operation is acting like a look-up table.
Transformers inherit the language and call these keys, queries, and values.
Attention Generalised
We would like to let the keys, queries, and values not be fully determined by Euclidean representation.
We imagine them as linear transforms of the total dictionary embedding values into a new space: \[k_i = W^Kx_i, q_i = W^Qx_i, v_i = W^Vx_i\]
This allows us to compress our key/query/value representation to a lower dimensionality.
Transformers Bare Bones
The transformer model is composed of two attentional models: an encoder and decoder.
Input is in the form of vectors of tokens e.g. genome sequence.
The encoder transforms input to an encoded attentional sequence.
The decoder autoregressively uses the encoder ouput to an ouput sequence.
The output sequence is decoded with a decoder dictionary e.g. amino acids.
Transformers.jl
Transformers.jl is built on top of Flux.
It has similar GPU support (and similiar drawbacks).
Similar grammar but there are subtleties - documentation is helpful!
Open source contribution - documentation could be added to! Anyone can do this.
Transfomer Codon Task
A codon is a group of three base nucleotides that encodes an amino acid.
Can we try and predict amino acid base pairs without knowledge of codons?
functionembeddingx(x) sem_em =embed_x(x, inv(sqrt(d))) em = sem_em .+position_embed(sem_em)return emend
Keys, Queries, and Values
The keys, queries and values are remembeddings of the information into a different representational subspace.
These are the key learnable parameters in the model.
An efficient subspace representation of information allows for impressive generalisability.
Attention Matrices
Each input vector forms a row and these can be concatenated for efficiency into a matrix: \(X\) (L x D)
K represents the keys, Q the querys, V the values.
Transformer Attention Head
The Transformer attention mechanism is given by: \[A(K,Q,V) = \text{softmax}\left(\frac{K^TQ}{\sqrt{D}}\right) \odot V\]\[K = W^KX, Q = W^QX, V = W^VX\]
Softmax is calculated row by row. \(\odot\) is element-wise product. Dimensional scaling \(\sqrt(D)\) for stability.
Multi-head attention
A transformer layer can have multiple heads analag. convolutional filter.
Each of these heads will learn to focus on different semantic relationships.
This can be efficiently encoded by simply concatenating each individual head. Usually, \(n \times h_d = D\).
Head dimension is not a hyper-parameter.
Residual and Normalisation
The outputs of the attention mechanism are mutliplied by matrix \(W^O\) to remebbed vectors into length D.
\(W^O\) is another learnable matrix.
The pre-attention inputs (residuals) are added to the re-embedded attention transformed inputs.
This combined vector is then layer-normalised.
Feed Forward Network
The normalised self-attention and residuals are passed through a feed-foward network: \[F(x) = \text{ReLu}(W_1x + b_1)W_2 + b_2\]
Inner Dimensions
The inner-dimension is independent of the embedding dimension e.g. 2048.
The feed forward network is shared between all tokens.
The residuals are added to the ouput of the FFN and layer-normalised.
Encoder Block/Layer
The operations of self-attention, feed forward networks, and layer-normalisation make an encoder block.
This can be thought of as a layer in a regular NN.
The Encoder is formed of several encoder blocks e.g. 6