Flux

A Julia approach to Machine Learning

Nicholas Gale and Stephen Eglen

Building Neural Networks

Time consuming.
Full of boilerplate.
Often inefficient.
These imply packages are useful!

Flux

Flux is a Julia package.
It is hackable, extensible, and provides nice syntatic sugar.
Relies on Zygote which is an automatic differentitation package.

Lecture Outcomes

Revisit primary concepts: automatic differentation, optimisation, loss functions.
Grammar of Flux.
Basic Flux Usage.
Constructing Neural Networks.

Loss Functions

Loss functions allow objective of the network to be found.
Constructed to measure difference between regressors and data.
Common loss functions are mse, logitcrossentropy, binarycrossentropy, etc.

Optimisation

A loss function may be optimised in any way.
The random sampling method is as efficient as any for finding global minima.
Finding local minima (or just reducing loss) is most often done through gradient descent.

Differentitation

Backbone of machine learning.
Newtons method or gradient descent is an optimisation method.
Can be computed symbolically, numerically, or automatically: \[ \frac{df}{dx} = \lim_{h\rightarrow 0} \frac{f(x+h) - f(x)}{h} \]

Automatic Differentitation

Define a dual number as: \(d(x) = (x, \epsilon)\)
By Taylor series the derivative is given by \(\epsilon\).
Julia: operator overload all the usual functions to handle dual numbers.
Differentitation is now arbitrary precision and cheaper to compute.

Chain Rule and Backpropagation

Backpropagation is the machine learning algorithm.
Errors are back-propagated through layers of network.
Under the hood: chain rule. \[\frac{\partial F}{\partial x} = \frac{\partial f_1}{\partial f_2}\frac{\partial f_2}{\partial f_3}\ldots\frac{\partial f_n}{\partial x}\]

Zygote

There are many automatic differentation packages in Julia: Flux uses Zygote.
The API call is gradient and is used in two ways: functions and do blocks.
Each method needs to specify the parameters which will be differentiated through Flux.params conversion.

Zygote Example

using Flux
f(x) = sum(tanh.(W * x))
W = rand(2,2)
x = rand(2,1)

# a single instance
dfdw = gradient(()->f(x), Flux.params(W))

# as a do block

g = gradient(Flux.params(W)) do
    f(x)
end

Grads(...)

Do blocks are particularly useful for defining callback functions to monitor progress.

Gradient Descent and Momentum

Gradient descent is defined as: \[W_{t+1} = W_t - \eta \frac{\partial L}{\partial W} \]
Gradient descent may be accelerated by momentum.
Flux has support for multiple optimisers through an optimiser object class: opt(lr, hyperparams)
Descent(lr), Momentum(lr, alp), ADAM()

Gradient Descent Comparison

Stochastic Gradient Descent

Gradients can be approximated by a subset of data.
Stochastic Gradient Descent is when these subsets are chosen randomly.
Helps avoid local minima and alleviates computer memory constraints.

Activation Functions

The activation function mimics the firing rate of a neuron.
It provides a non-linearity to allow for complex learning manifolds.
Flux supports all the common functions: tanh, relu, σ, etc.

Flux layers

Neural networks are composed as layers of weights.
Flux offers default layers which can be found in the documentation: Dense, Conv, MaxPool, RNNCell etc.
We can also define our own custom layers by a structure.
Zygote differentates through structure fields

Custom Flux Layer Example

struct CustomDense
    W
    b
    function CustomDense(dimsin::Int, dimsout::Int)
        W = rand(dimsout, dimsin)
        b = rand(dimsout)
        new(W,b)
    end
end
# overload object definition
(obj::CustomDense)(x) = obj.W * x .+ obj.b
dense = CustomDense(5,2)
dense(rand(5))

2-element Vector{Float64}:
 0.38609011196942344
 1.7445256031830105

Dense

The dense layer represents the all-to-all connections.
API is Dense(dimsin => dimsout).
Optional keywords are bias=true/false, and activation.

Conv

The convolutional layer is specified with Conv((kerneldim1, kerneldim), featuresin=>featuresout)
It accepts keywords stride=(s1,s2), pad=(p1, p2), dilation=(d1,d2), bias, activation.
The layer accepts a 4D tensor: x dimension, y dimension, feature dimension, and data-index.

Flatten, Dropout, Pooling

flatten layers take an input and flatten it to a vector.
Dropout layers freeze a fraction of parameters in learning.
BatchNorm normalises mean/variance.
AlphaDropout layers freeze parameters and normalise mean and variance.
There are extensive pooling layers e.g. MaxPool most common.

Chain: Composing a model

Composing a model is simply a chain of layers.
Flux allows this to be intutively done with chain: model(x) = Chain(Dense(10=>3), Dense(3=>5))
A generically complex model can be given by a function f.
Flux can be told this complex function is a model with @functor f.

Data and Batching

Data is thought of “naturally” in Flux.
It operates on a single point, or a vector of points.
A convenience function DataLoader is provided.
It zips data pairs and can optionally shuffle with a given batch size.

xvec = rand(40,100); yvec = rand(100);
data = Flux.DataLoader((xvec, yvec); batchsize=5, shuffle=true)
for (x, y) in data
    # loop 20 times through batches of 5
end

Loss Functions

Loss functions are defined on individual data pairs.
Flux then aggregates them along a batch along the last dimension.
Loss aggregration is by default the average: can be specified with keyword agg.
There are many default loss functions: mse, logitcrossentropy, HuberLoss etc.

Updating parameters

We have a model, data, optimisers, and loss functions and can now update parameters.
update! takes an optimiser, gradient values, and updates specified parameters.
update!(opt, Flux.params(w...), grad)
This can be applied through a loop over the data where the gradient is repeatedly computed.

Training

An even easier API to work with is train!.
It accepts a loss function, parameters, a data generator, and optimiser.
train!(loss, Flux.params(w...), data, opt)

Callbacks

A callback function provides a printout under certain conditions.
They are passed as optional keywords to train e.g. train!(...; cb = ()->println("This calls back"))
By default they print every epoch but can be slowed down with cb = throttle(some_func, epoch_interval)
Can be used with Flux.stop() to abort training if a condition is met e.g. accuracy goal, timing limit etc.

GPU support

A model can be piped into a GPU/CPU device with the gpu and cpu APIs
For a GPU based model the data should be converted to CUDA arrays. Only CUDA is supported.

using CUDA
model(x) = Chain(Dense(5=>3), Dense(3=>1)) |> gpu
inp = CUDA.rand(5)
model(inp)

Model Saving

To save a model we need the BSON package.
A model is saved with @save "dir" model_obj
A model is loaded with @load "dir" model_obj
It is recommended to pipe models to a CPU device before saving.

Workflow

Within a training loop you can be expressive.
Common to plot loss functions on training and validation sets.
Common to plot a data-visualisation (particularly for generative models)
Can have an indication of training time: elapsed and estimated finish.

Training a CNN

using Flux, JLD2
data = JLD2.load("./data/sharks/sharkdata.jld")
training_data, training_labels = [data["train_data"], data["train_labels"]]

model = Chain(
Flux.Conv((3,3), 3=>10, relu),
Flux.MaxPool((2,2)),
Flux.Conv((5,5), 10 => 5, relu), 
Flux.MaxPool((2,2)), 
Flux.Conv((10,10), 5 => 7, relu), 
Flux.flatten,
Dense(112, length(unique(training_labels)), σ))

Chain(
  Conv((3, 3), 3 => 10, relu),          # 280 parameters
  MaxPool((2, 2)),
  Conv((5, 5), 10 => 5, relu),          # 1_255 parameters
  MaxPool((2, 2)),
  Conv((10, 10), 5 => 7, relu),         # 3_507 parameters
  Flux.flatten,
  Dense(112 => 3, σ),                   # 339 parameters
)                   # Total: 8 arrays, 5_381 parameters, 22.449 KiB.

@time for e in 1:100 # epochs
    d = Flux.DataLoader((training_data, Flux.onehotbatch(training_labels, unique(training_labels))), shuffle=true, batchsize=5)
    Flux.train!((x,y) -> Flux.logitcrossentropy(Flux.softmax(model(x)),y), Flux.params(model), d, ADAM())
end

predictions = map(i -> argmax(model(data["test_data"])[:,i]), 1:size(model(data["test_data"]))[2])
labels = ["thresher", "nurse", "basking"]
pred_labels = [labels[predictions[i]] for i in 1:length(predictions)]
@show acc = sum(pred_labels .== data["test_labels"])/length(pred_labels);

 30.337100 seconds (39.47 M allocations: 10.089 GiB, 7.34% gc time, 56.53% compilation time)

acc = sum(pred_labels .== data["test_labels"]) / length(pred_labels) = 0.8

Summary

Machine learning is sophisticated methods of optimising an objective/loss function.
Flux allows us to abstract the boiler-plate of machine learning code.
It is fast, flexible, and allows us to define models simply with a chain and layer structure.
Can focus on the conceptual difficulties, rather than the technical challenges.