In this post, we will verify the technique developed in 2015 and described in the paper Cyclical Learning Rates for Training Neural Networks. This technique describes a new method for changing the learning rate on each iteration (or batch) in order to get the maximum performance of the training process, achieving the maximum accuracy of the trainning using the minimum number of epochs and consequently saving time.
The method described is called “training with Cyclical Learning Rates”. The aim of this methodology is to train the neural network with a learning rate that changes in a cyclic way for each step or mini-batch, instead of a non-cyclic learning rate that is a constant value or maybe changes using a decay on every epoch.
We will also see how to determine the “reasonable bounds” around the cyclic learning rate will oscillate. This technique is also explained in this other post, and because learning rate is maybe the most important hyperparameter for training neural networks, how to set the learning rate limits up for each network configuration could be a critical step before start training.
We will use the keras package for R for trainning the model, and it will train using Tensorflow on the backend. We will train it from scratch instead of use a pretrained model, I think it’s the best option for this post in order to see the Cyclical Learning Rates performance. Also, and because I want to focus on the cyclic learning rate methodology, we will not use data augmentation for trainning the model, this will allow us to focus on the specific part of the code for this pourpose, instead of trying to achieve maximum accuracy of the model.
It’s highly recommendable to run this example using GPU’s. I ran it on the AWS (Amazon Web Services) cloud Computing servers using a p2.xlarge instance.
First of all, you need the keras library on your system, if you need to install it go to Install Keras and the TensorFlow backend.
1 2 3 4 5 |
library(zoo) #-- library(keras) use_backend("tensorflow") # for using: k_clear_session() |
Loading the dataset
For this example we will use CIFAR10 small images dataset that consists of 60000 colour images (50000 training images and 10000 test images) of 32×32 pixels size. The images are classified into 10 classes: airplane, automovile, bird, cat, deer, dog, frog, horse, ship and truck.
We will load into memory the CIFAR10 dataset, because is included in the “keras” package we can load it easily using dataset_cifar10()
.
1 2 3 |
# Data Preparation -------------------------------------------------------- cifar10 <- dataset_cifar10() |
Next, we need to split the dataset into train and test sets. The ‘x’ matrices (x_train
, x_test
) contain the images (3 matrices for each image, each matrix corresponds to a RGB color code: one for red, green and blue color components), we need to normalize the image matrix data by dividing by 255. The ‘y’ matrices (y_train
, y_test
) contain the ‘labels’ of the data in a one-hot encoding structure.
1 2 3 4 5 6 |
# Feature scale RGB values in test and train inputs x_train <- cifar10$train$x/255 x_test <- cifar10$test$x/255 y_train <- to_categorical(cifar10$train$y, num_classes = 10) y_test <- to_categorical(cifar10$test$y, num_classes = 10) |
1 2 3 4 5 6 7 8 9 10 11 12 |
# train dataset dim(x_train) ## [1] 50000 32 32 3 dim(y_train) ## [1] 50000 10 # test dataset dim(x_test) ## [1] 10000 32 32 3 dim(y_test) ## [1] 10000 10 |
The code of each ‘y’ label of the dataset refers to an animal or a vehicle according to the following order.
1 2 3 4 5 6 7 8 9 10 11 |
categ <- c("plane", "auto", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck") |
We set the batch size to 32 and the number of epochs to 75 as shown below.
1 2 3 4 |
# Parameters -------------------------------------------------------------- batch_size <- 32 epochs <- 75 |
In the next code, you can have a better idea fo the dataset having a look at the pictures. We can plot the pictures using the package imager that is based on CImg.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager # in ubuntu needs to install Cairo: # > sudo apt-get install libcairo2-dev # > sudo apt-get install libxt-dev library(imager) IM <- list() for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3)) par(mfrow=c(15,30), mar=c(0,0,0.5,0)) for(i in 1:(15*30)){ plot(IM[[i]], axes=FALSE) title(categ[which.max(y_train[i,])], cex.main=1.0) } |
Neural network configuration
Because we want to check the performance of the same neural network using different learning rate configurations, below we defined a function that creates the same neural network each time the function is called. Notice the k_clear_session()
and use_session_with_seed()
functions in order to have the same starting point on each new train and for reproducibility.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
create_model <- function(Learning_rate=0.001, decay=0){ # Defining Model ---------------------------------------------------------- # KERAS clear session k_clear_session() use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research # Initialize sequential model model <- keras_model_sequential() model %>% # Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same", input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=32, kernel_size=c(3,3), kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.40) %>% # 2 additional hidden 2D convolutional layers layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.40) %>% # 2 additional hidden 2D convolutional layers layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.50) %>% # flatten the output into 10 unit output layer layer_flatten() %>% layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>% layer_activation("softmax") ##------ model %>% compile( loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001 metrics = "accuracy" ) return(model) } |
The network is based on three pairs of convolutional networks intercalated with max pooling, and then two dense layers at the end.
Let’s summarise the structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
create_model() ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## conv2d_1 (Conv2D) (None, 32, 32, 32) 896 ## ___________________________________________________________________________ ## activation_1 (Activation) (None, 32, 32, 32) 0 ## ___________________________________________________________________________ ## batch_normalization_1 (BatchNorm (None, 32, 32, 32) 128 ## ___________________________________________________________________________ ## conv2d_2 (Conv2D) (None, 30, 30, 32) 9248 ## ___________________________________________________________________________ ## activation_2 (Activation) (None, 30, 30, 32) 0 ## ___________________________________________________________________________ ## batch_normalization_2 (BatchNorm (None, 30, 30, 32) 128 ## ___________________________________________________________________________ ## max_pooling2d_1 (MaxPooling2D) (None, 15, 15, 32) 0 ## ___________________________________________________________________________ ## dropout_1 (Dropout) (None, 15, 15, 32) 0 ## ___________________________________________________________________________ ## conv2d_3 (Conv2D) (None, 15, 15, 64) 18496 ## ___________________________________________________________________________ ## activation_3 (Activation) (None, 15, 15, 64) 0 ## ___________________________________________________________________________ ## batch_normalization_3 (BatchNorm (None, 15, 15, 64) 256 ## ___________________________________________________________________________ ## conv2d_4 (Conv2D) (None, 15, 15, 64) 36928 ## ___________________________________________________________________________ ## activation_4 (Activation) (None, 15, 15, 64) 0 ## ___________________________________________________________________________ ## batch_normalization_4 (BatchNorm (None, 15, 15, 64) 256 ## ___________________________________________________________________________ ## max_pooling2d_2 (MaxPooling2D) (None, 7, 7, 64) 0 ## ___________________________________________________________________________ ## dropout_2 (Dropout) (None, 7, 7, 64) 0 ## ___________________________________________________________________________ ## conv2d_5 (Conv2D) (None, 7, 7, 128) 73856 ## ___________________________________________________________________________ ## activation_5 (Activation) (None, 7, 7, 128) 0 ## ___________________________________________________________________________ ## batch_normalization_5 (BatchNorm (None, 7, 7, 128) 512 ## ___________________________________________________________________________ ## conv2d_6 (Conv2D) (None, 7, 7, 128) 147584 ## ___________________________________________________________________________ ## activation_6 (Activation) (None, 7, 7, 128) 0 ## ___________________________________________________________________________ ## batch_normalization_6 (BatchNorm (None, 7, 7, 128) 512 ## ___________________________________________________________________________ ## max_pooling2d_3 (MaxPooling2D) (None, 3, 3, 128) 0 ## ___________________________________________________________________________ ## dropout_3 (Dropout) (None, 3, 3, 128) 0 ## ___________________________________________________________________________ ## flatten_1 (Flatten) (None, 1152) 0 ## ___________________________________________________________________________ ## dense_1 (Dense) (None, 10) 11530 ## ___________________________________________________________________________ ## activation_7 (Activation) (None, 10) 0 ## =========================================================================== ## Total params: 300,330 ## Trainable params: 299,434 ## Non-trainable params: 896 ## ___________________________________________________________________________ |
Callback functions
For our cyclic learning rate example, we need also to set some specific callback functions. Callbacks functions are functions can be run during the training process in order to do different things, such as saving the model weights after an epoch, changing hyperparameters, or writing log files. there are some predefined callback functions in keras, but you can also create your own custom callbacks.
Callback for logging metrics on each iteration
Because in a normal train process keras framework give us the training metrics (accuracy, loss) at the end of each epoch, we need to create a function for getting it on each ‘iteration’ or ‘batch’. Remember that one epoch has ‘n’ iterations, and each iterarion uses a ‘batch’ of ‘m’ images.
The next function is an R6 Class function based on the kerasCallback functions and will log the accuracy and loss into the LogMetrics
object at the end of each batch. This is usefull because implementing this piece of code you will have more control how the cyclic learning rate works on each iteration.
1 2 3 4 5 6 7 8 9 10 11 |
LogMetrics <- R6::R6Class("LogMetrics", inherit = KerasCallback, public = list( loss = NULL, acc = NULL, on_batch_end = function(batch, logs=list()) { self$loss <- c(self$loss, logs[["loss"]]) self$acc <- c(self$acc, logs[["acc"]]) } )) |
Callbacks for changing the learning rate on each iteration
We need also to define three more callbacks functions:
callback_lr_init
: will set the counter to zero and clear the learning rate historylr_hist
and iteration historyiter_hist
callback_lr_set
: will set the learning rate according to thel_rate
vector for each iterationcallback_lr_log
: will log the learning rate value and iteration number at the objects:lr_hist
anditer_hist
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
callback_lr_init <- function(logs){ iter <<- 0 lr_hist <<- c() iter_hist <<- c() } callback_lr_set <- function(batch, logs){ iter <<- iter + 1 LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value if(is.na(LR)) LR <- l_rate[length(l_rate)] k_set_value(model$optimizer$lr, LR) } callback_lr_log <- function(batch, logs){ lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr)) iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations)) } |
These functions must be embedded into the callback_lambda()
function as you can see below.
1 2 3 |
callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set) callback_logger <- callback_lambda(on_batch_end=callback_lr_log) |
Finding the best learning rates boundaries
Learning rate is one of the most important hyperparameter for training a neural network. So it’s very important to know before starting a full trainning process, in which ranges the network converges and diverges. In order to find the best learning rate boundaries you can follow the methodology of this paper, which describes that is very easy and you will only need to spend only a few epochs.
For doing that, We will train the model only five epochs (epochs_find_LR
is set to 5 in this example). We also will increase the learning rate on each iteration until the maximum value of learning rate, defined by lr_max
, is allowed.
In the next code, the learning rate will be increased from 0 to lr_max
using an exponential function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
## Varing LR # we set low epochs epochs_find_LR <- 5 # learning rate searcher lr_max <- 0.1 n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size)) growth_constant <- 15 # our learner will be an exponential function: l_rate <- exp(seq(0, growth_constant, length.out=n_iter)) l_rate <- l_rate/max(l_rate) l_rate <- l_rate * lr_max plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate") |
1 2 |
plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)") |
Next, we need to init the callback_log_acc_lr
and the model
. After that, we will train the model using 5 epochs for finding the learning rate boundaries.
Notice, in the code below how the callback functions are implemented using callbacks=list(callback_lr, callback_logger, callback_log_acc)
. By this way we are telling keras to execute the callback functions on each iteration of the training process.
1 2 3 4 5 6 7 8 9 10 11 12 |
callback_log_acc_lr <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs_find_LR, shuffle = TRUE, callbacks = list(callback_lr, callback_logger, callback_log_acc_lr), verbose = 2) |
After finished, we can plot the accuracy against learning rate curve.
1 2 |
plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy") |
For a better understanding, we can smooth the previous curve using a rolling average of 100, and add the boundaries of the learning rate. Between this range, we must expect that our network will be able to increase the accuracy, as you can see the network starts to learn around 8e-6 and finish around 2e-3.
So, adding a safety margin we can set the learning rate boundaries to Learning_rate_l = 2e-5
(blue line) and Learning_rate_h = 8e-4
(red line).
1 2 3 |
Learning_rate_l <- 2e-5 Learning_rate_h <- 8e-4 |
1 2 3 4 5 6 |
plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)") abline(v=8e-6, col="grey60") abline(v=2e-3, col="grey60") abline(v=Learning_rate_l, col="blue") abline(v=Learning_rate_h, col="red") |
Training the model: low LR
Next, we will train the model over 75 epoch using a constant learning rate value of Learning_rate_l = 2e-5
that corresponds to the lower learning rate value found before.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Training ---------------------------------------------------------------- callback_log_acc_low <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_l, decay=0) # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_low), verbose = 2) |
1 2 |
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) |
1 2 3 4 5 6 7 8 |
history ## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75) ## Final epoch (plot to see history): ## acc: 0.7737 ## loss: 0.6852 ## val_acc: 0.7748 ## val_loss: 0.6986 |
As you can see, the accuracy over each iteration is increasing.
1 2 |
plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue") |
1 2 |
plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue") |
Training the model: high LR
Now we can train the model using the higher learning rate boundary of Learning_rate_h = 8e-4
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Training ---------------------------------------------------------------- callback_log_acc_high <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_h, decay=0) # #fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_high), verbose = 2) |
1 2 |
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) |
1 2 3 4 5 6 7 8 |
history ## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75) ## Final epoch (plot to see history): ## acc: 0.8625 ## loss: 0.5927 ## val_acc: 0.8268 ## val_loss: 0.7395 |
As you can see the accuracy increases quickly and achieves a better performance at the end of the training. This behaviour can be different using other network configurations or hyperparameters. Sometimes training in the higher zone of the learning rate boundaries, the model starts increassing the accuracy and then at a given point, the accuracy comes down. In this case it could be a good strategy to stop the training when accuracy changes the trend.
1 2 |
plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red") |
1 2 |
plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red") |
Training the model: high LR (with decay)
Now let’s see if how performs the model using the high boundary learning rate with a decay value. For that, we can add the argument decay=1e-4
and then retrain the model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Training ---------------------------------------------------------------- callback_log_acc_high_decay <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3) # #fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_high_decay), verbose = 2) |
1 2 |
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) |
1 2 3 4 5 6 7 8 |
history ## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75) ## Final epoch (plot to see history): ## acc: 0.819 ## loss: 0.565 ## val_acc: 0.8138 ## val_loss: 0.5926 |
The maximum accuracy never achieves the maximum accuracy achieved without decay. We could try to hiper-tune the decay value in order to maximize the final accuracy, but this is computationally expensive in a trial and error process.
1 2 |
plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange") |
In the next part, we will show how using cyclic learning rates we can achieve higher results with less computation.
Cyclical Learning Rate function
In order to do a cyclic learning rate we need to define a function called Cyclic_LR
that has been translated from python to R see here. This function will return a vector with the learning rate value for each iteration. This output vector will be used in the previous defined callback functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
#################### Cyclic_LR <- function(iteration=1:32000, base_lr=1e-5, max_lr=1e-3, step_size=2000, mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle'){ # translated from python to R, original at: https://github.com/bckenstler/CLR/blob/master/clr_callback.py # This callback implements a cyclical learning rate policy (CLR). # The method cycles the learning rate between two boundaries with # some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186). # The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. # This class has three built-in policies, as put forth in the paper. # - "triangular": A basic triangular cycle w/ no amplitude scaling. # - "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. # - "exp_range": A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration. # - "sinus": A sinusoidal form cycle # # Example # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=2000, mode='triangular', num_iterations=20000) # > plot(clr, cex=0.2) # Class also supports custom scaling functions with function output max value of 1: # > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2) # # Arguments # iteration: # if is a number: # id of the iteration where: max iteration = epochs * (samples/batch) # if "iteration" is a vector i.e.: iteration=1:10000: # returns the whole sequence of lr as a vector # base_lr: initial learning rate which is the # lower boundary in the cycle. # max_lr: upper boundary in the cycle. Functionally, # it defines the cycle amplitude (max_lr - base_lr). # The lr at any cycle is the sum of base_lr # and some scaling of the amplitude; therefore # max_lr may not actually be reached depending on # scaling function. # step_size: number of training iterations per # half cycle. Authors suggest setting step_size # 2-8 x training iterations in epoch. # mode: one of {triangular, triangular2, exp_range, sinus}. # Default 'triangular'. # Values correspond to policies detailed above. # If scale_fn is not None, this argument is ignored. # gamma: constant in 'exp_range' scaling function: # gamma**(cycle iterations) # scale_fn: Custom scaling policy defined by a single # argument lambda function, where # 0 <= scale_fn(x) <= 1 for all x >= 0. # mode paramater is ignored # scale_mode: {'cycle', 'iterations'}. # Defines whether scale_fn is evaluated on # cycle number or cycle iterations (training # iterations since start of cycle). Default is 'cycle'. ######## if(is.null(scale_fn)==TRUE){ if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';} if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';} if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';} if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';} } lr <- list() if(is.vector(iteration)==TRUE){ for(iter in iteration){ cycle <- floor(1 + (iter / (2*step_size))) x2 <- abs(iter/step_size-2 * cycle+1) if(scale_mode=='cycle') x <- cycle if(scale_mode=='iterations') x <- iter lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x) } } lr <- do.call("rbind",lr) return(as.vector(lr)) } #################### |
Below you can plot the output vector of the Cyclic_LR
function using the mode='triangular'
.
The next figure is the result of a decay value of gamma=0.99997
using the mode='exp_range'
argument. The step_size (number of training iterations per half cycle) will be set to floor(n_iter/75)
.
1 2 3 4 5 |
n_iter <- ceiling(epochs * (NROW(x_train)/batch_size)) l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75), mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle') plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50") |
Training the model: Cyclical Learning Rate
Next, we will repeat our training using the triangular cyclic learning rate showed above.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
callback_log_acc_clr <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_lr,callback_logger,callback_log_acc_clr), verbose = 2) |
1 2 |
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) |
1 2 3 4 5 6 7 8 |
history ## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75) ## Final epoch (plot to see history): ## acc: 0.8793 ## loss: 0.5059 ## val_acc: 0.8406 ## val_loss: 0.665 |
In the plot below we added to the cyclic accuracy curve to the accuracy curves that we got before. As we can see, the cyclic methodology achieves the highest accuracy, so it can be a interesting tool to be applyed on the training process of neural networks.
1 2 3 4 5 |
plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1)) lines(rollmean(callback_log_acc_high$acc,500), col="red") lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange") lines(rollmean(callback_log_acc_low$acc,500), col="blue") |
Next we will add a decay to the cyclical learning rate in order to see how performs the network.
Training the model: Cyclical Learning Rate (with decay)
Here, we will add to our triangular cyclical learning rate a decay value that can be easily found plotting the learning rate over all the expected iterations, and trying to adjust the decay for getting less scope to the boundaries at the final part of the training.
The next figure is the result of a decay value of gamma=0.99997
using the mode='exp_range'
argument. The step_size (number of training iterations per half cycle) will be set to floor(n_iter/75)
.
1 2 3 4 5 |
n_iter <- ceiling(epochs * (NROW(x_train)/batch_size)) l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75), mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle') plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black") |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
callback_log_acc_clr2 <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2), verbose = 2) |
1 2 |
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) |
1 2 3 4 5 6 7 8 |
history ## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75) ## Final epoch (plot to see history): ## acc: 0.9036 ## loss: 0.345 ## val_acc: 0.8584 ## val_loss: 0.5155 |
Training the model we obtain the best accuracy of all the alternatives done before, and this highest accuracy has been achieved with less number of epochs.
1 2 3 4 5 6 |
plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1)) lines(rollmean(callback_log_acc_high$acc,500), col="red") lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange") lines(rollmean(callback_log_acc_low$acc,500), col="blue") lines(rollmean(callback_log_acc_clr2$acc,500), col="black") |
Confusion matrix
Below we can check the confusion matrix of the Cyclical Learning Rate (with decay) model and for the test dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Predict the classes for the test data classes_pred <- model %>% predict_classes(x_test) classes_pred <- categ[as.vector(classes_pred)+1] classes_test <- categ[apply(y_test, 1, which.max)] table(classes_pred, classes_test) ## classes_test ## classes_pred auto bird cat deer dog frog horse plane ship truck ## auto 927 0 2 1 1 0 0 8 9 32 ## bird 0 760 42 26 25 11 11 26 4 2 ## cat 2 22 657 13 91 19 24 10 2 5 ## deer 2 57 58 887 32 16 41 12 3 1 ## dog 1 39 138 16 808 4 27 1 0 0 ## frog 3 64 71 32 18 940 15 8 8 5 ## horse 0 10 10 19 15 3 869 5 0 3 ## plane 5 45 10 4 5 2 7 878 24 12 ## ship 15 1 7 2 2 4 3 38 937 19 ## truck 45 2 5 0 3 1 3 14 13 921 |
Conclusion
Cyclical Learning Rates for training Neural Networks is a very good technique for training a neural network in an efficient way, and also achieving the maximum accuracy (or the minimum loss) as far as we have checked using the CIFAR10 dataset.
I also demonstrated that it is worth to expend some time at the starting point of training workflow, by finding the best learning rate boundaries in order to save time and computational power in the training process.
Additional notes: if we execute the code with batch 128, although a slightly lower performance is obtained, the execution time -thanks to the vectorization- is reduced considerably (around 66%). Even so, according to the tests I have done, there is no remarkable improvement using the Cyclic Learning Rate and for this specific example and dataset (using batch 128). This is surely due to the fact that with batch 128 the number of iterations is much lower (iter: 29.297) whereas with batch 32 they have been done 4 times more (iter: 117.188).
As a conclusion, it can be said that the “Cyclic Learning Rate” works better with a large number of iterations. In addition, for any batch size, and it is demonstrated in all cases the usefulness of obtaining the “Finding the best learning rates boundaries” and to use in the training the highest learning rate.
Session Info:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
------------------------------------ Total R execution time: 6.2 hours ------------------------------------ setting value version R version 3.5.1 (2018-07-02) os Ubuntu 16.04.5 LTS system x86_64, linux-gnu ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz Etc/UTC date 2019-01-01 ------------------------------------ Packages: [1] "bitops - 1.0-6 - 2013-08-17 - CRAN (R 3.5.1)" [2] "imager - 0.41.1 - 2018-05-30 - CRAN (R 3.5.1)" [3] "keras - 2.2.0 - 2018-08-24 - CRAN (R 3.5.1)" [4] "knitr - 1.20 - 2018-02-20 - CRAN (R 3.5.1)" [5] "magrittr - 1.5 - 2014-11-22 - CRAN (R 3.5.1)" [6] "RCurl - 1.95-4.11 - 2018-07-15 - CRAN (R 3.5.1)" [7] "reshape2 - 1.4.3 - 2017-12-11 - CRAN (R 3.5.1)" [8] "RWordPress - 0.2-3 - 2018-11-07 - Github (duncantl/RWordPress@ce6d2d6)" [9] "sessioninfo - 1.1.1 - 2018-11-05 - CRAN (R 3.5.1)" [10] "stringr - 1.3.1 - 2018-05-10 - CRAN (R 3.5.1)" [11] "XMLRPC - 0.3-1 - 2018-11-07 - Github (duncantl/XMLRPC@add9496)" [12] "zoo - 1.8-4 - 2018-09-19 - CRAN (R 3.5.1)" |
Appendix, all the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 |
library(zoo) #-- library(keras) use_backend("tensorflow") # for using: k_clear_session() # Data Preparation -------------------------------------------------------- cifar10 <- dataset_cifar10() # Feature scale RGB values in test and train inputs x_train <- cifar10$train$x/255 x_test <- cifar10$test$x/255 y_train <- to_categorical(cifar10$train$y, num_classes = 10) y_test <- to_categorical(cifar10$test$y, num_classes = 10) # train dataset dim(x_train) dim(y_train) # test dataset dim(x_test) dim(y_test) categ <- c("plane", "auto", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck") # Parameters -------------------------------------------------------------- batch_size <- 32 epochs <- 75 # imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager # in ubuntu needs to install Cairo: # > sudo apt-get install libcairo2-dev # > sudo apt-get install libxt-dev library(imager) IM <- list() for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3)) par(mfrow=c(15,30), mar=c(0,0,0.5,0)) for(i in 1:(15*30)){ plot(IM[[i]], axes=FALSE) title(categ[which.max(y_train[i,])], cex.main=1.0) } create_model <- function(Learning_rate=0.001, decay=0){ # Defining Model ---------------------------------------------------------- # KERAS clear session k_clear_session() use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research # Initialize sequential model model <- keras_model_sequential() model %>% # Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same", input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=32, kernel_size=c(3,3), kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.40) %>% # 2 additional hidden 2D convolutional layers layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.40) %>% # 2 additional hidden 2D convolutional layers layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% ##-- layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same", kernel_regularizer=regularizer_l2(1e-4)) %>% layer_activation("relu") %>% layer_batch_normalization() %>% # max pooling layer_max_pooling_2d(pool_size=c(2,2)) %>% layer_dropout(0.50) %>% # flatten the output into 10 unit output layer layer_flatten() %>% layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>% layer_activation("softmax") ##------ model %>% compile( loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001 metrics = "accuracy" ) return(model) } create_model() LogMetrics <- R6::R6Class("LogMetrics", inherit = KerasCallback, public = list( loss = NULL, acc = NULL, on_batch_end = function(batch, logs=list()) { self$loss <- c(self$loss, logs[["loss"]]) self$acc <- c(self$acc, logs[["acc"]]) } )) callback_lr_init <- function(logs){ iter <<- 0 lr_hist <<- c() iter_hist <<- c() } callback_lr_set <- function(batch, logs){ iter <<- iter + 1 LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value if(is.na(LR)) LR <- l_rate[length(l_rate)] k_set_value(model$optimizer$lr, LR) } callback_lr_log <- function(batch, logs){ lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr)) iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations)) } callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set) callback_logger <- callback_lambda(on_batch_end=callback_lr_log) ## Varing LR # we set low epochs epochs_find_LR <- 5 # learning rate searcher lr_max <- 0.1 n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size)) growth_constant <- 15 # our learner will be an exponential function: l_rate <- exp(seq(0, growth_constant, length.out=n_iter)) l_rate <- l_rate/max(l_rate) l_rate <- l_rate * lr_max plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate") plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)") callback_log_acc_lr <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs_find_LR, shuffle = TRUE, callbacks = list(callback_lr, callback_logger, callback_log_acc_lr), verbose = 2) plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy") Learning_rate_l <- 2e-5 Learning_rate_h <- 8e-4 plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)") abline(v=8e-6, col="grey60") abline(v=2e-3, col="grey60") abline(v=Learning_rate_l, col="blue") abline(v=Learning_rate_h, col="red") # Training ---------------------------------------------------------------- callback_log_acc_low <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_l, decay=0) # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_low), verbose = 2) plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) history plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue") plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue") # Training ---------------------------------------------------------------- callback_log_acc_high <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_h, decay=0) # #fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_high), verbose = 2) plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) history plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red") plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red") # Training ---------------------------------------------------------------- callback_log_acc_high_decay <- LogMetrics$new() model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3) # #fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_log_acc_high_decay), verbose = 2) plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) history plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange") #################### Cyclic_LR <- function(iteration=1:32000, base_lr=1e-5, max_lr=1e-3, step_size=2000, mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle'){ # translated from python to R, original at: https://github.com/bckenstler/CLR/blob/master/clr_callback.py # This callback implements a cyclical learning rate policy (CLR). # The method cycles the learning rate between two boundaries with # some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186). # The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. # This class has three built-in policies, as put forth in the paper. # - "triangular": A basic triangular cycle w/ no amplitude scaling. # - "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. # - "exp_range": A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration. # - "sinus": A sinusoidal form cycle # # Example # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=2000, mode='triangular', num_iterations=20000) # > plot(clr, cex=0.2) # Class also supports custom scaling functions with function output max value of 1: # > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2) # # Arguments # iteration: # if is a number: # id of the iteration where: max iteration = epochs * (samples/batch) # if "iteration" is a vector i.e.: iteration=1:10000: # returns the whole sequence of lr as a vector # base_lr: initial learning rate which is the # lower boundary in the cycle. # max_lr: upper boundary in the cycle. Functionally, # it defines the cycle amplitude (max_lr - base_lr). # The lr at any cycle is the sum of base_lr # and some scaling of the amplitude; therefore # max_lr may not actually be reached depending on # scaling function. # step_size: number of training iterations per # half cycle. Authors suggest setting step_size # 2-8 x training iterations in epoch. # mode: one of {triangular, triangular2, exp_range, sinus}. # Default 'triangular'. # Values correspond to policies detailed above. # If scale_fn is not None, this argument is ignored. # gamma: constant in 'exp_range' scaling function: # gamma**(cycle iterations) # scale_fn: Custom scaling policy defined by a single # argument lambda function, where # 0 <= scale_fn(x) <= 1 for all x >= 0. # mode paramater is ignored # scale_mode: {'cycle', 'iterations'}. # Defines whether scale_fn is evaluated on # cycle number or cycle iterations (training # iterations since start of cycle). Default is 'cycle'. ######## if(is.null(scale_fn)==TRUE){ if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';} if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';} if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';} if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';} } lr <- list() if(is.vector(iteration)==TRUE){ for(iter in iteration){ cycle <- floor(1 + (iter / (2*step_size))) x2 <- abs(iter/step_size-2 * cycle+1) if(scale_mode=='cycle') x <- cycle if(scale_mode=='iterations') x <- iter lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x) } } lr <- do.call("rbind",lr) return(as.vector(lr)) } #################### n_iter <- ceiling(epochs * (NROW(x_train)/batch_size)) l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75), mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle') plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50") callback_log_acc_clr <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_lr,callback_logger,callback_log_acc_clr), verbose = 2) plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) history plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1)) lines(rollmean(callback_log_acc_high$acc,500), col="red") lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange") lines(rollmean(callback_log_acc_low$acc,500), col="blue") n_iter <- ceiling(epochs * (NROW(x_train)/batch_size)) l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75), mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle') plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black") callback_log_acc_clr2 <- LogMetrics$new() model <- create_model() # fit without data_augmentation --------------------- history <- model %>% fit( x_train, y_train, batch_size = batch_size, epochs = epochs, validation_data = list(x_test, y_test), shuffle = TRUE, callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2), verbose = 2) plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE)) history plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1)) lines(rollmean(callback_log_acc_high$acc,500), col="red") lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange") lines(rollmean(callback_log_acc_low$acc,500), col="blue") lines(rollmean(callback_log_acc_clr2$acc,500), col="black") |