In this post, we will verify the technique developed in 2015 and described in the paper Cyclical Learning Rates for Training Neural Networks. This technique describes a new method for changing the learning rate on each iteration (or batch) in order to get the maximum performance of the training process, achieving the maximum accuracy of the trainning using the minimum number of epochs and consequently saving time.

The method described is called “training with Cyclical Learning Rates”. The aim of this methodology is to train the neural network with a learning rate that changes in a cyclic way for each step or mini-batch, instead of a non-cyclic learning rate that is a constant value or maybe changes using a decay on every epoch.

We will also see how to determine the “reasonable bounds” around the cyclic learning rate will oscillate. This technique is also explained in this other post, and because learning rate is maybe the most important hyperparameter for training neural networks, how to set the learning rate limits up for each network configuration could be a critical step before start training.

We will use the keras package for R for trainning the model, and it will train using Tensorflow on the backend. We will train it from scratch instead of use a pretrained model, I think it’s the best option for this post in order to see the Cyclical Learning Rates performance. Also, and because I want to focus on the cyclic learning rate methodology, we will not use data augmentation for trainning the model, this will allow us to focus on the specific part of the code for this pourpose, instead of trying to achieve maximum accuracy of the model.

It’s highly recommendable to run this example using GPU’s. I ran it on the AWS (Amazon Web Services) cloud Computing servers using a p2.xlarge instance.

First of all, you need the keras library on your system, if you need to install it go to Install Keras and the TensorFlow backend.

library(zoo)
#--
library(keras)
use_backend("tensorflow") # for using: k_clear_session()

library(zoo)

#--

library(keras)

use_backend("tensorflow") # for using: k_clear_session()

Loading the dataset

For this example we will use CIFAR10 small images dataset that consists of 60000 colour images (50000 training images and 10000 test images) of 32×32 pixels size. The images are classified into 10 classes: airplane, automovile, bird, cat, deer, dog, frog, horse, ship and truck.

We will load into memory the CIFAR10 dataset, because is included in the “keras” package we can load it easily using dataset_cifar10().

# Data Preparation --------------------------------------------------------
cifar10 <- dataset_cifar10()

# Data Preparation --------------------------------------------------------

cifar10 <- dataset_cifar10()

Next, we need to split the dataset into train and test sets. The ‘x’ matrices (x_train, x_test) contain the images (3 matrices for each image, each matrix corresponds to a RGB color code: one for red, green and blue color components), we need to normalize the image matrix data by dividing by 255. The ‘y’ matrices (y_train, y_test) contain the ‘labels’ of the data in a one-hot encoding structure.

# Feature scale RGB values in test and train inputs  
x_train <- cifar10$train$x/255
x_test <- cifar10$test$x/255
y_train <- to_categorical(cifar10$train$y, num_classes = 10)
y_test <- to_categorical(cifar10$test$y, num_classes = 10)

# Feature scale RGB values in test and train inputs

x_train <- cifar10$train$x/255

x_test <- cifar10$test$x/255

y_train <- to_categorical(cifar10$train$y, num_classes = 10)

y_test <- to_categorical(cifar10$test$y, num_classes = 10)

# train dataset
dim(x_train)
## [1] 50000    32    32     3
dim(y_train)
## [1] 50000    10

# test dataset
dim(x_test)
## [1] 10000    32    32     3
dim(y_test)
## [1] 10000    10

# train dataset

dim(x_train)

## [1] 50000 32 32 3

dim(y_train)

## [1] 50000 10

# test dataset

dim(x_test)

## [1] 10000 32 32 3

dim(y_test)

## [1] 10000 10

The code of each ‘y’ label of the dataset refers to an animal or a vehicle according to the following order.

categ <- c("plane",
            "auto",
            "bird",
            "cat",
            "deer",
            "dog",
            "frog",
            "horse",
            "ship",
            "truck")

categ <- c("plane",

"auto",

"bird",

"cat",

"deer",

"dog",

"frog",

"horse",

"ship",

"truck")

We set the batch size to 32 and the number of epochs to 75 as shown below.

# Parameters --------------------------------------------------------------
batch_size <- 32
epochs <- 75

# Parameters --------------------------------------------------------------

batch_size <- 32

epochs <- 75

In the next code, you can have a better idea fo the dataset having a look at the pictures. We can plot the pictures using the package imager that is based on CImg.

# imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager
# in ubuntu needs to install Cairo:
# > sudo apt-get install libcairo2-dev
# > sudo apt-get install libxt-dev
library(imager)
IM <- list()
for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3))
par(mfrow=c(15,30), mar=c(0,0,0.5,0))
for(i in 1:(15*30)){
      plot(IM[[i]], axes=FALSE)
      title(categ[which.max(y_train[i,])], cex.main=1.0)
}

# imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager

# in ubuntu needs to install Cairo:

# > sudo apt-get install libcairo2-dev

# > sudo apt-get install libxt-dev

library(imager)

IM <- list()

for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3))

par(mfrow=c(15,30), mar=c(0,0,0.5,0))

for(i in 1:(15*30)){

plot(IM[[i]], axes=FALSE)

title(categ[which.max(y_train[i,])], cex.main=1.0)

}

Neural network configuration

Because we want to check the performance of the same neural network using different learning rate configurations, below we defined a function that creates the same neural network each time the function is called. Notice the k_clear_session() and use_session_with_seed() functions in order to have the same starting point on each new train and for reproducibility.

create_model <- function(Learning_rate=0.001, decay=0){

      # Defining Model ----------------------------------------------------------
      # KERAS clear session
      k_clear_session()

      use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research

      # Initialize sequential model
      model <- keras_model_sequential() model %>%

        # Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images
        layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same",
              input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=32, kernel_size=c(3,3),
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.40) %>%

        # 2 additional hidden 2D convolutional layers
        layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.40) %>%

        # 2 additional hidden 2D convolutional layers
        layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.50) %>%

        # flatten the output into 10 unit output layer
        layer_flatten() %>%
        layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>%
        layer_activation("softmax")

      ##------
      model %>% compile(
        loss = "categorical_crossentropy",
        optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001
        metrics = "accuracy"
      )

      return(model)
}

create_model <- function(Learning_rate=0.001, decay=0){

# Defining Model ----------------------------------------------------------

# KERAS clear session

k_clear_session()

use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research

# Initialize sequential model

model <- keras_model_sequential() model %>%

# Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images

layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same",

input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=32, kernel_size=c(3,3),

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.40) %>%

# 2 additional hidden 2D convolutional layers

layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.40) %>%

# 2 additional hidden 2D convolutional layers

layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.50) %>%

# flatten the output into 10 unit output layer

layer_flatten() %>%

layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>%

layer_activation("softmax")

##------

model %>% compile(

loss = "categorical_crossentropy",

optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001

metrics = "accuracy"

)

return(model)

}

The network is based on three pairs of convolutional networks intercalated with max pooling, and then two dense layers at the end.

Let’s summarise the structure:

create_model()
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## conv2d_1 (Conv2D)                (None, 32, 32, 32)            896         
## ___________________________________________________________________________
## activation_1 (Activation)        (None, 32, 32, 32)            0           
## ___________________________________________________________________________
## batch_normalization_1 (BatchNorm (None, 32, 32, 32)            128         
## ___________________________________________________________________________
## conv2d_2 (Conv2D)                (None, 30, 30, 32)            9248        
## ___________________________________________________________________________
## activation_2 (Activation)        (None, 30, 30, 32)            0           
## ___________________________________________________________________________
## batch_normalization_2 (BatchNorm (None, 30, 30, 32)            128         
## ___________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 32)            0           
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 15, 15, 32)            0           
## ___________________________________________________________________________
## conv2d_3 (Conv2D)                (None, 15, 15, 64)            18496       
## ___________________________________________________________________________
## activation_3 (Activation)        (None, 15, 15, 64)            0           
## ___________________________________________________________________________
## batch_normalization_3 (BatchNorm (None, 15, 15, 64)            256         
## ___________________________________________________________________________
## conv2d_4 (Conv2D)                (None, 15, 15, 64)            36928       
## ___________________________________________________________________________
## activation_4 (Activation)        (None, 15, 15, 64)            0           
## ___________________________________________________________________________
## batch_normalization_4 (BatchNorm (None, 15, 15, 64)            256         
## ___________________________________________________________________________
## max_pooling2d_2 (MaxPooling2D)   (None, 7, 7, 64)              0           
## ___________________________________________________________________________
## dropout_2 (Dropout)              (None, 7, 7, 64)              0           
## ___________________________________________________________________________
## conv2d_5 (Conv2D)                (None, 7, 7, 128)             73856       
## ___________________________________________________________________________
## activation_5 (Activation)        (None, 7, 7, 128)             0           
## ___________________________________________________________________________
## batch_normalization_5 (BatchNorm (None, 7, 7, 128)             512         
## ___________________________________________________________________________
## conv2d_6 (Conv2D)                (None, 7, 7, 128)             147584      
## ___________________________________________________________________________
## activation_6 (Activation)        (None, 7, 7, 128)             0           
## ___________________________________________________________________________
## batch_normalization_6 (BatchNorm (None, 7, 7, 128)             512         
## ___________________________________________________________________________
## max_pooling2d_3 (MaxPooling2D)   (None, 3, 3, 128)             0           
## ___________________________________________________________________________
## dropout_3 (Dropout)              (None, 3, 3, 128)             0           
## ___________________________________________________________________________
## flatten_1 (Flatten)              (None, 1152)                  0           
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 10)                    11530       
## ___________________________________________________________________________
## activation_7 (Activation)        (None, 10)                    0           
## ===========================================================================
## Total params: 300,330
## Trainable params: 299,434
## Non-trainable params: 896
## ___________________________________________________________________________

create_model()

## Model

## ___________________________________________________________________________

## Layer (type) Output Shape Param #

## ===========================================================================

## conv2d_1 (Conv2D) (None, 32, 32, 32) 896

## ___________________________________________________________________________

## activation_1 (Activation) (None, 32, 32, 32) 0

## ___________________________________________________________________________

## batch_normalization_1 (BatchNorm (None, 32, 32, 32) 128

## ___________________________________________________________________________

## conv2d_2 (Conv2D) (None, 30, 30, 32) 9248

## ___________________________________________________________________________

## activation_2 (Activation) (None, 30, 30, 32) 0

## ___________________________________________________________________________

## batch_normalization_2 (BatchNorm (None, 30, 30, 32) 128

## ___________________________________________________________________________

## max_pooling2d_1 (MaxPooling2D) (None, 15, 15, 32) 0

## ___________________________________________________________________________

## dropout_1 (Dropout) (None, 15, 15, 32) 0

## ___________________________________________________________________________

## conv2d_3 (Conv2D) (None, 15, 15, 64) 18496

## ___________________________________________________________________________

## activation_3 (Activation) (None, 15, 15, 64) 0

## ___________________________________________________________________________

## batch_normalization_3 (BatchNorm (None, 15, 15, 64) 256

## ___________________________________________________________________________

## conv2d_4 (Conv2D) (None, 15, 15, 64) 36928

## ___________________________________________________________________________

## activation_4 (Activation) (None, 15, 15, 64) 0

## ___________________________________________________________________________

## batch_normalization_4 (BatchNorm (None, 15, 15, 64) 256

## ___________________________________________________________________________

## max_pooling2d_2 (MaxPooling2D) (None, 7, 7, 64) 0

## ___________________________________________________________________________

## dropout_2 (Dropout) (None, 7, 7, 64) 0

## ___________________________________________________________________________

## conv2d_5 (Conv2D) (None, 7, 7, 128) 73856

## ___________________________________________________________________________

## activation_5 (Activation) (None, 7, 7, 128) 0

## ___________________________________________________________________________

## batch_normalization_5 (BatchNorm (None, 7, 7, 128) 512

## ___________________________________________________________________________

## conv2d_6 (Conv2D) (None, 7, 7, 128) 147584

## ___________________________________________________________________________

## activation_6 (Activation) (None, 7, 7, 128) 0

## ___________________________________________________________________________

## batch_normalization_6 (BatchNorm (None, 7, 7, 128) 512

## ___________________________________________________________________________

## max_pooling2d_3 (MaxPooling2D) (None, 3, 3, 128) 0

## ___________________________________________________________________________

## dropout_3 (Dropout) (None, 3, 3, 128) 0

## ___________________________________________________________________________

## flatten_1 (Flatten) (None, 1152) 0

## ___________________________________________________________________________

## dense_1 (Dense) (None, 10) 11530

## ___________________________________________________________________________

## activation_7 (Activation) (None, 10) 0

## ===========================================================================

## Total params: 300,330

## Trainable params: 299,434

## Non-trainable params: 896

## ___________________________________________________________________________

Callback functions

For our cyclic learning rate example, we need also to set some specific callback functions. Callbacks functions are functions can be run during the training process in order to do different things, such as saving the model weights after an epoch, changing hyperparameters, or writing log files. there are some predefined callback functions in keras, but you can also create your own custom callbacks.

Callback for logging metrics on each iteration

Because in a normal train process keras framework give us the training metrics (accuracy, loss) at the end of each epoch, we need to create a function for getting it on each ‘iteration’ or ‘batch’. Remember that one epoch has ‘n’ iterations, and each iterarion uses a ‘batch’ of ‘m’ images.

The next function is an R6 Class function based on the kerasCallback functions and will log the accuracy and loss into the LogMetrics object at the end of each batch. This is usefull because implementing this piece of code you will have more control how the cyclic learning rate works on each iteration.

LogMetrics <- R6::R6Class("LogMetrics",
  inherit = KerasCallback,
  public = list(
    loss = NULL,
    acc = NULL,
    on_batch_end = function(batch, logs=list()) {
      self$loss <- c(self$loss, logs[["loss"]])
      self$acc <- c(self$acc, logs[["acc"]])
    }
))

LogMetrics <- R6::R6Class("LogMetrics",

inherit = KerasCallback,

public = list(

loss = NULL,

acc = NULL,

on_batch_end = function(batch, logs=list()) {

self$loss <- c(self$loss, logs[["loss"]])

self$acc <- c(self$acc, logs[["acc"]])

}

))

Callbacks for changing the learning rate on each iteration

We need also to define three more callbacks functions:

callback_lr_init: will set the counter to zero and clear the learning rate history lr_hist and iteration history iter_hist
callback_lr_set: will set the learning rate according to the l_rate vector for each iteration
callback_lr_log: will log the learning rate value and iteration number at the objects: lr_hist and iter_hist

callback_lr_init <- function(logs){
      iter <<- 0
      lr_hist <<- c()
      iter_hist <<- c()
}
callback_lr_set <- function(batch, logs){
      iter <<- iter + 1
      LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value
      if(is.na(LR)) LR <- l_rate[length(l_rate)]
      k_set_value(model$optimizer$lr, LR)
}
callback_lr_log <- function(batch, logs){
      lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr))
      iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations))
}

callback_lr_init <- function(logs){

iter <<- 0

lr_hist <<- c()

iter_hist <<- c()

}

callback_lr_set <- function(batch, logs){

iter <<- iter + 1

LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value

if(is.na(LR)) LR <- l_rate[length(l_rate)]

k_set_value(model$optimizer$lr, LR)

}

callback_lr_log <- function(batch, logs){

lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr))

iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations))

}

These functions must be embedded into the callback_lambda() function as you can see below.

callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set)
callback_logger <- callback_lambda(on_batch_end=callback_lr_log)

callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set)

callback_logger <- callback_lambda(on_batch_end=callback_lr_log)

Finding the best learning rates boundaries

Learning rate is one of the most important hyperparameter for training a neural network. So it’s very important to know before starting a full trainning process, in which ranges the network converges and diverges. In order to find the best learning rate boundaries you can follow the methodology of this paper, which describes that is very easy and you will only need to spend only a few epochs.

For doing that, We will train the model only five epochs (epochs_find_LR is set to 5 in this example). We also will increase the learning rate on each iteration until the maximum value of learning rate, defined by lr_max, is allowed.

In the next code, the learning rate will be increased from 0 to lr_max using an exponential function.

## Varing LR

# we set low epochs
epochs_find_LR <- 5

# learning rate searcher
lr_max <- 0.1
n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size))
growth_constant <- 15

# our learner will be an exponential function:
l_rate <- exp(seq(0, growth_constant, length.out=n_iter))
l_rate <- l_rate/max(l_rate)
l_rate <- l_rate * lr_max
plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate")

## Varing LR

# we set low epochs

epochs_find_LR <- 5

# learning rate searcher

lr_max <- 0.1

n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size))

growth_constant <- 15

# our learner will be an exponential function:

l_rate <- exp(seq(0, growth_constant, length.out=n_iter))

l_rate <- l_rate/max(l_rate)

l_rate <- l_rate * lr_max

plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate")

plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)")

1 2	plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)")

Next, we need to init the callback_log_acc_lr and the model. After that, we will train the model using 5 epochs for finding the learning rate boundaries.

Notice, in the code below how the callback functions are implemented using callbacks=list(callback_lr, callback_logger, callback_log_acc). By this way we are telling keras to execute the callback functions on each iteration of the training process.

callback_log_acc_lr <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs_find_LR,
      shuffle = TRUE,
      callbacks = list(callback_lr, callback_logger, callback_log_acc_lr),
      verbose = 2)

callback_log_acc_lr <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs_find_LR,

shuffle = TRUE,

callbacks = list(callback_lr, callback_logger, callback_log_acc_lr),

verbose = 2)

After finished, we can plot the accuracy against learning rate curve.

plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy")

1 2	plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy")

For a better understanding, we can smooth the previous curve using a rolling average of 100, and add the boundaries of the learning rate. Between this range, we must expect that our network will be able to increase the accuracy, as you can see the network starts to learn around 8e-6 and finish around 2e-3.

So, adding a safety margin we can set the learning rate boundaries to Learning_rate_l = 2e-5 (blue line) and Learning_rate_h = 8e-4 (red line).

Learning_rate_l <- 2e-5
Learning_rate_h <- 8e-4

Learning_rate_l <- 2e-5

Learning_rate_h <- 8e-4

plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)")
abline(v=8e-6, col="grey60")
abline(v=2e-3, col="grey60")
abline(v=Learning_rate_l, col="blue")
abline(v=Learning_rate_h, col="red")

plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)")

abline(v=8e-6, col="grey60")

abline(v=2e-3, col="grey60")

abline(v=Learning_rate_l, col="blue")

abline(v=Learning_rate_h, col="red")

Training the model: low LR

Next, we will train the model over 75 epoch using a constant learning rate value of Learning_rate_l = 2e-5 that corresponds to the lower learning rate value found before.

# Training ----------------------------------------------------------------
callback_log_acc_low <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_l, decay=0)

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_low),
      verbose = 2)

# Training ----------------------------------------------------------------

callback_log_acc_low <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_l, decay=0)

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_low),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

1 2	plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history
## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)
## Final epoch (plot to see history):
##      acc: 0.7737
##     loss: 0.6852
##  val_acc: 0.7748
## val_loss: 0.6986

history

## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)

## Final epoch (plot to see history):

## acc: 0.7737

## loss: 0.6852

## val_acc: 0.7748

## val_loss: 0.6986

As you can see, the accuracy over each iteration is increasing.

plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue")

1 2	plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue")

plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue")

1 2	plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue")

Training the model: high LR

Now we can train the model using the higher learning rate boundary of Learning_rate_h = 8e-4.

# Training ----------------------------------------------------------------
callback_log_acc_high <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=0)

# #fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_high),
      verbose = 2)

# Training ----------------------------------------------------------------

callback_log_acc_high <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=0)

# #fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_high),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

1 2	plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history
## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)
## Final epoch (plot to see history):
##      acc: 0.8625
##     loss: 0.5927
##  val_acc: 0.8268
## val_loss: 0.7395

history

## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)

## Final epoch (plot to see history):

## acc: 0.8625

## loss: 0.5927

## val_acc: 0.8268

## val_loss: 0.7395

As you can see the accuracy increases quickly and achieves a better performance at the end of the training. This behaviour can be different using other network configurations or hyperparameters. Sometimes training in the higher zone of the learning rate boundaries, the model starts increassing the accuracy and then at a given point, the accuracy comes down. In this case it could be a good strategy to stop the training when accuracy changes the trend.

plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red")

1 2	plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red")

plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red")

1 2	plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red")

Training the model: high LR (with decay)

Now let’s see if how performs the model using the high boundary learning rate with a decay value. For that, we can add the argument decay=1e-4 and then retrain the model.

# Training ----------------------------------------------------------------
callback_log_acc_high_decay <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3)

# #fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_high_decay),
      verbose = 2)

# Training ----------------------------------------------------------------

callback_log_acc_high_decay <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3)

# #fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_high_decay),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

1 2	plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history
## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)
## Final epoch (plot to see history):
##      acc: 0.819
##     loss: 0.565
##  val_acc: 0.8138
## val_loss: 0.5926

history

## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)

## Final epoch (plot to see history):

## acc: 0.819

## loss: 0.565

## val_acc: 0.8138

## val_loss: 0.5926

The maximum accuracy never achieves the maximum accuracy achieved without decay. We could try to hiper-tune the decay value in order to maximize the final accuracy, but this is computationally expensive in a trial and error process.

plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange")

1 2	plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange")

In the next part, we will show how using cyclic learning rates we can achieve higher results with less computation.

Cyclical Learning Rate function

In order to do a cyclic learning rate we need to define a function called Cyclic_LR that has been translated from python to R see here. This function will return a vector with the learning rate value for each iteration. This output vector will be used in the previous defined callback functions.

####################
Cyclic_LR <- function(iteration=1:32000, base_lr=1e-5, max_lr=1e-3, step_size=2000, mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle'){ # translated from python to R, original at: https://github.com/bckenstler/CLR/blob/master/clr_callback.py # This callback implements a cyclical learning rate policy (CLR). # The method cycles the learning rate between two boundaries with # some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186). # The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. # This class has three built-in policies, as put forth in the paper. # - "triangular": A basic triangular cycle w/ no amplitude scaling. # - "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. # - "exp_range": A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration. # - "sinus": A sinusoidal form cycle # # Example # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=2000, mode='triangular', num_iterations=20000) # > plot(clr, cex=0.2)

      # Class also supports custom scaling functions with function output max value of 1:
      # > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2)

      # # Arguments
      #   iteration:
      #       if is a number:
      #           id of the iteration where: max iteration = epochs * (samples/batch)
      #       if "iteration" is a vector i.e.: iteration=1:10000:
      #           returns the whole sequence of lr as a vector
      #   base_lr: initial learning rate which is the
      #       lower boundary in the cycle.
      #   max_lr: upper boundary in the cycle. Functionally,
      #       it defines the cycle amplitude (max_lr - base_lr).
      #       The lr at any cycle is the sum of base_lr
      #       and some scaling of the amplitude; therefore 
      #       max_lr may not actually be reached depending on
      #       scaling function.
      #   step_size: number of training iterations per
      #       half cycle. Authors suggest setting step_size
      #       2-8 x training iterations in epoch.
      #   mode: one of {triangular, triangular2, exp_range, sinus}.
      #       Default 'triangular'.
      #       Values correspond to policies detailed above.
      #       If scale_fn is not None, this argument is ignored.
      #   gamma: constant in 'exp_range' scaling function:
      #       gamma**(cycle iterations)
      #   scale_fn: Custom scaling policy defined by a single
      #       argument lambda function, where 
      #       0 <= scale_fn(x) <= 1 for all x >= 0.
      #       mode paramater is ignored 
      #   scale_mode: {'cycle', 'iterations'}.
      #       Defines whether scale_fn is evaluated on 
      #       cycle number or cycle iterations (training
      #       iterations since start of cycle). Default is 'cycle'.

      ########
      if(is.null(scale_fn)==TRUE){
            if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';}
            if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';}
            if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';}
            if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';}
      }
      lr <- list()
      if(is.vector(iteration)==TRUE){
            for(iter in iteration){
                  cycle <- floor(1 + (iter / (2*step_size)))
                  x2 <- abs(iter/step_size-2 * cycle+1)
                  if(scale_mode=='cycle') x <- cycle
                  if(scale_mode=='iterations') x <- iter
                  lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x)
            }
      }
      lr <- do.call("rbind",lr)
      return(as.vector(lr))
}
####################

####################

Cyclic_LR <- function(iteration=1:32000, base_lr=1e-5, max_lr=1e-3, step_size=2000, mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle'){ # translated from python to R, original at: https://github.com/bckenstler/CLR/blob/master/clr_callback.py # This callback implements a cyclical learning rate policy (CLR). # The method cycles the learning rate between two boundaries with # some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186). # The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. # This class has three built-in policies, as put forth in the paper. # - "triangular": A basic triangular cycle w/ no amplitude scaling. # - "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. # - "exp_range": A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration. # - "sinus": A sinusoidal form cycle # # Example # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=2000, mode='triangular', num_iterations=20000) # > plot(clr, cex=0.2)

# Class also supports custom scaling functions with function output max value of 1:

# > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2)

# # Arguments

# iteration:

# if is a number:

# id of the iteration where: max iteration = epochs * (samples/batch)

# if "iteration" is a vector i.e.: iteration=1:10000:

# returns the whole sequence of lr as a vector

# base_lr: initial learning rate which is the

# lower boundary in the cycle.

# max_lr: upper boundary in the cycle. Functionally,

# it defines the cycle amplitude (max_lr - base_lr).

# The lr at any cycle is the sum of base_lr

# and some scaling of the amplitude; therefore

# max_lr may not actually be reached depending on

# scaling function.

# step_size: number of training iterations per

# half cycle. Authors suggest setting step_size

# 2-8 x training iterations in epoch.

# mode: one of {triangular, triangular2, exp_range, sinus}.

# Default 'triangular'.

# Values correspond to policies detailed above.

# If scale_fn is not None, this argument is ignored.

# gamma: constant in 'exp_range' scaling function:

# gamma**(cycle iterations)

# scale_fn: Custom scaling policy defined by a single

# argument lambda function, where

# 0 <= scale_fn(x) <= 1 for all x >= 0.

# mode paramater is ignored

# scale_mode: {'cycle', 'iterations'}.

# Defines whether scale_fn is evaluated on

# cycle number or cycle iterations (training

# iterations since start of cycle). Default is 'cycle'.

########

if(is.null(scale_fn)==TRUE){

if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';}

if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';}

if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';}

if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';}

}

lr <- list()

if(is.vector(iteration)==TRUE){

for(iter in iteration){

cycle <- floor(1 + (iter / (2*step_size)))

x2 <- abs(iter/step_size-2 * cycle+1)

if(scale_mode=='cycle') x <- cycle

if(scale_mode=='iterations') x <- iter

lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x)

}

lr <- do.call("rbind",lr)

return(as.vector(lr))

}

####################

Below you can plot the output vector of the Cyclic_LR function using the mode='triangular'.

The next figure is the result of a decay value of gamma=0.99997 using the mode='exp_range' argument. The step_size (number of training iterations per half cycle) will be set to floor(n_iter/75).

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))
l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),
                        mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle')
plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50")

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))

l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),

mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle')

plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50")

Training the model: Cyclical Learning Rate

Next, we will repeat our training using the triangular cyclic learning rate showed above.

callback_log_acc_clr <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_lr,callback_logger,callback_log_acc_clr),
      verbose = 2)

callback_log_acc_clr <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_lr,callback_logger,callback_log_acc_clr),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

1 2	plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history
## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)
## Final epoch (plot to see history):
##      acc: 0.8793
##     loss: 0.5059
##  val_acc: 0.8406
## val_loss: 0.665

history

## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)

## Final epoch (plot to see history):

## acc: 0.8793

## loss: 0.5059

## val_acc: 0.8406

## val_loss: 0.665

In the plot below we added to the cyclic accuracy curve to the accuracy curves that we got before. As we can see, the cyclic methodology achieves the highest accuracy, so it can be a interesting tool to be applyed on the training process of neural networks.

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))
lines(rollmean(callback_log_acc_high$acc,500), col="red")
lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")
lines(rollmean(callback_log_acc_low$acc,500), col="blue")

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))

lines(rollmean(callback_log_acc_high$acc,500), col="red")

lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")

lines(rollmean(callback_log_acc_low$acc,500), col="blue")

Next we will add a decay to the cyclical learning rate in order to see how performs the network.

Training the model: Cyclical Learning Rate (with decay)

Here, we will add to our triangular cyclical learning rate a decay value that can be easily found plotting the learning rate over all the expected iterations, and trying to adjust the decay for getting less scope to the boundaries at the final part of the training.

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))
l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),
                        mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle')
plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black")

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))

l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),

mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle')

plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black")

callback_log_acc_clr2 <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2),
      verbose = 2)

callback_log_acc_clr2 <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

1 2	plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history
## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)
## Final epoch (plot to see history):
##      acc: 0.9036
##     loss: 0.345
##  val_acc: 0.8584
## val_loss: 0.5155

history

## Trained on 50,000 samples, validated on 10,000 samples (batch_size=32, epochs=75)

## Final epoch (plot to see history):

## acc: 0.9036

## loss: 0.345

## val_acc: 0.8584

## val_loss: 0.5155

Training the model we obtain the best accuracy of all the alternatives done before, and this highest accuracy has been achieved with less number of epochs.

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))
lines(rollmean(callback_log_acc_high$acc,500), col="red")
lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")
lines(rollmean(callback_log_acc_low$acc,500), col="blue")
lines(rollmean(callback_log_acc_clr2$acc,500), col="black")

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))

lines(rollmean(callback_log_acc_high$acc,500), col="red")

lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")

lines(rollmean(callback_log_acc_low$acc,500), col="blue")

lines(rollmean(callback_log_acc_clr2$acc,500), col="black")

Confusion matrix

Below we can check the confusion matrix of the Cyclical Learning Rate (with decay) model and for the test dataset.

# Predict the classes for the test data
classes_pred <- model %>% predict_classes(x_test)
classes_pred <- categ[as.vector(classes_pred)+1]
classes_test <- categ[apply(y_test, 1, which.max)]
table(classes_pred, classes_test)
##             classes_test
## classes_pred auto bird cat deer dog frog horse plane ship truck
##        auto   927    0   2    1   1    0     0     8    9    32
##        bird     0  760  42   26  25   11    11    26    4     2
##        cat      2   22 657   13  91   19    24    10    2     5
##        deer     2   57  58  887  32   16    41    12    3     1
##        dog      1   39 138   16 808    4    27     1    0     0
##        frog     3   64  71   32  18  940    15     8    8     5
##        horse    0   10  10   19  15    3   869     5    0     3
##        plane    5   45  10    4   5    2     7   878   24    12
##        ship    15    1   7    2   2    4     3    38  937    19
##        truck   45    2   5    0   3    1     3    14   13   921

# Predict the classes for the test data

classes_pred <- model %>% predict_classes(x_test)

classes_pred <- categ[as.vector(classes_pred)+1]

classes_test <- categ[apply(y_test, 1, which.max)]

table(classes_pred, classes_test)

## classes_test

## classes_pred auto bird cat deer dog frog horse plane ship truck

## auto 927 0 2 1 1 0 0 8 9 32

## bird 0 760 42 26 25 11 11 26 4 2

## cat 2 22 657 13 91 19 24 10 2 5

## deer 2 57 58 887 32 16 41 12 3 1

## dog 1 39 138 16 808 4 27 1 0 0

## frog 3 64 71 32 18 940 15 8 8 5

## horse 0 10 10 19 15 3 869 5 0 3

## plane 5 45 10 4 5 2 7 878 24 12

## ship 15 1 7 2 2 4 3 38 937 19

## truck 45 2 5 0 3 1 3 14 13 921

Conclusion

Cyclical Learning Rates for training Neural Networks is a very good technique for training a neural network in an efficient way, and also achieving the maximum accuracy (or the minimum loss) as far as we have checked using the CIFAR10 dataset.

I also demonstrated that it is worth to expend some time at the starting point of training workflow, by finding the best learning rate boundaries in order to save time and computational power in the training process.

Additional notes: if we execute the code with batch 128, although a slightly lower performance is obtained, the execution time -thanks to the vectorization- is reduced considerably (around 66%). Even so, according to the tests I have done, there is no remarkable improvement using the Cyclic Learning Rate and for this specific example and dataset (using batch 128). This is surely due to the fact that with batch 128 the number of iterations is much lower (iter: 29.297) whereas with batch 32 they have been done 4 times more (iter: 117.188).

As a conclusion, it can be said that the “Cyclic Learning Rate” works better with a large number of iterations. In addition, for any batch size, and it is demonstrated in all cases the usefulness of obtaining the “Finding the best learning rates boundaries” and to use in the training the highest learning rate.

Session Info:

------------------------------------
Total R execution time:  6.2 hours 
------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Ubuntu 16.04.5 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2019-01-01                  

------------------------------------
Packages:
 [1] "bitops - 1.0-6 - 2013-08-17 - CRAN (R 3.5.1)"                          
 [2] "imager - 0.41.1 - 2018-05-30 - CRAN (R 3.5.1)"                         
 [3] "keras - 2.2.0 - 2018-08-24 - CRAN (R 3.5.1)"                           
 [4] "knitr - 1.20 - 2018-02-20 - CRAN (R 3.5.1)"                            
 [5] "magrittr - 1.5 - 2014-11-22 - CRAN (R 3.5.1)"                          
 [6] "RCurl - 1.95-4.11 - 2018-07-15 - CRAN (R 3.5.1)"                       
 [7] "reshape2 - 1.4.3 - 2017-12-11 - CRAN (R 3.5.1)"                        
 [8] "RWordPress - 0.2-3 - 2018-11-07 - Github (duncantl/RWordPress@ce6d2d6)"
 [9] "sessioninfo - 1.1.1 - 2018-11-05 - CRAN (R 3.5.1)"                     
[10] "stringr - 1.3.1 - 2018-05-10 - CRAN (R 3.5.1)"                         
[11] "XMLRPC - 0.3-1 - 2018-11-07 - Github (duncantl/XMLRPC@add9496)"        
[12] "zoo - 1.8-4 - 2018-09-19 - CRAN (R 3.5.1)"

------------------------------------

Total R execution time: 6.2 hours

------------------------------------

setting value

version R version 3.5.1 (2018-07-02)

os Ubuntu 16.04.5 LTS

system x86_64, linux-gnu

ui RStudio

language (EN)

collate en_US.UTF-8

ctype en_US.UTF-8

tz Etc/UTC

date 2019-01-01

------------------------------------

Packages:

[1] "bitops - 1.0-6 - 2013-08-17 - CRAN (R 3.5.1)"

[2] "imager - 0.41.1 - 2018-05-30 - CRAN (R 3.5.1)"

[3] "keras - 2.2.0 - 2018-08-24 - CRAN (R 3.5.1)"

[4] "knitr - 1.20 - 2018-02-20 - CRAN (R 3.5.1)"

[5] "magrittr - 1.5 - 2014-11-22 - CRAN (R 3.5.1)"

[6] "RCurl - 1.95-4.11 - 2018-07-15 - CRAN (R 3.5.1)"

[7] "reshape2 - 1.4.3 - 2017-12-11 - CRAN (R 3.5.1)"

[8] "RWordPress - 0.2-3 - 2018-11-07 - Github (duncantl/RWordPress@ce6d2d6)"

[9] "sessioninfo - 1.1.1 - 2018-11-05 - CRAN (R 3.5.1)"

[10] "stringr - 1.3.1 - 2018-05-10 - CRAN (R 3.5.1)"

[11] "XMLRPC - 0.3-1 - 2018-11-07 - Github (duncantl/XMLRPC@add9496)"

[12] "zoo - 1.8-4 - 2018-09-19 - CRAN (R 3.5.1)"

Appendix, all the code:

library(zoo)
#--
library(keras)
use_backend("tensorflow") # for using: k_clear_session()
# Data Preparation --------------------------------------------------------
cifar10 <- dataset_cifar10()
# Feature scale RGB values in test and train inputs  
x_train <- cifar10$train$x/255
x_test <- cifar10$test$x/255
y_train <- to_categorical(cifar10$train$y, num_classes = 10)
y_test <- to_categorical(cifar10$test$y, num_classes = 10)
# train dataset
dim(x_train)
dim(y_train)

# test dataset
dim(x_test)
dim(y_test)
categ <- c("plane",
            "auto",
            "bird",
            "cat",
            "deer",
            "dog",
            "frog",
            "horse",
            "ship",
            "truck")
# Parameters --------------------------------------------------------------
batch_size <- 32
epochs <- 75 # imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager # in ubuntu needs to install Cairo: # > sudo apt-get install libcairo2-dev
# > sudo apt-get install libxt-dev
library(imager)
IM <- list()
for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3))
par(mfrow=c(15,30), mar=c(0,0,0.5,0))
for(i in 1:(15*30)){
      plot(IM[[i]], axes=FALSE)
      title(categ[which.max(y_train[i,])], cex.main=1.0)
}
create_model <- function(Learning_rate=0.001, decay=0){

      # Defining Model ----------------------------------------------------------
      # KERAS clear session
      k_clear_session()

      use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research

      # Initialize sequential model
      model <- keras_model_sequential() model %>%

        # Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images
        layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same",
              input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=32, kernel_size=c(3,3),
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.40) %>%

        # 2 additional hidden 2D convolutional layers
        layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.40) %>%

        # 2 additional hidden 2D convolutional layers
        layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%
        ##--
        layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",
              kernel_regularizer=regularizer_l2(1e-4)) %>%
        layer_activation("relu") %>%
        layer_batch_normalization() %>%

        # max pooling
        layer_max_pooling_2d(pool_size=c(2,2)) %>%
        layer_dropout(0.50) %>%

        # flatten the output into 10 unit output layer
        layer_flatten() %>%
        layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>%
        layer_activation("softmax")

      ##------
      model %>% compile(
        loss = "categorical_crossentropy",
        optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001
        metrics = "accuracy"
      )

      return(model)
}
create_model()
LogMetrics <- R6::R6Class("LogMetrics",
  inherit = KerasCallback,
  public = list(
    loss = NULL,
    acc = NULL,
    on_batch_end = function(batch, logs=list()) {
      self$loss <- c(self$loss, logs[["loss"]])
      self$acc <- c(self$acc, logs[["acc"]])
    }
))
callback_lr_init <- function(logs){
      iter <<- 0
      lr_hist <<- c()
      iter_hist <<- c()
}
callback_lr_set <- function(batch, logs){
      iter <<- iter + 1
      LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value
      if(is.na(LR)) LR <- l_rate[length(l_rate)]
      k_set_value(model$optimizer$lr, LR)
}
callback_lr_log <- function(batch, logs){
      lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr))
      iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations))
}
callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set)
callback_logger <- callback_lambda(on_batch_end=callback_lr_log)
## Varing LR

# we set low epochs
epochs_find_LR <- 5

# learning rate searcher
lr_max <- 0.1
n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size))
growth_constant <- 15

# our learner will be an exponential function:
l_rate <- exp(seq(0, growth_constant, length.out=n_iter))
l_rate <- l_rate/max(l_rate)
l_rate <- l_rate * lr_max
plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate")
plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)")
callback_log_acc_lr <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs_find_LR,
      shuffle = TRUE,
      callbacks = list(callback_lr, callback_logger, callback_log_acc_lr),
      verbose = 2)
plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy")
Learning_rate_l <- 2e-5
Learning_rate_h <- 8e-4
plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)")
abline(v=8e-6, col="grey60")
abline(v=2e-3, col="grey60")
abline(v=Learning_rate_l, col="blue")
abline(v=Learning_rate_h, col="red")
# Training ----------------------------------------------------------------
callback_log_acc_low <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_l, decay=0)

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_low),
      verbose = 2)
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))
history
plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue")
plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue")
# Training ----------------------------------------------------------------
callback_log_acc_high <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=0)

# #fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_high),
      verbose = 2)
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))
history
plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red")
plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red")
# Training ----------------------------------------------------------------
callback_log_acc_high_decay <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3)

# #fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_log_acc_high_decay),
      verbose = 2)
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))
history
plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange")
####################
Cyclic_LR <- function(iteration=1:32000, base_lr=1e-5, max_lr=1e-3, step_size=2000, mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle'){ # translated from python to R, original at: https://github.com/bckenstler/CLR/blob/master/clr_callback.py # This callback implements a cyclical learning rate policy (CLR). # The method cycles the learning rate between two boundaries with # some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186). # The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. # This class has three built-in policies, as put forth in the paper. # - "triangular": A basic triangular cycle w/ no amplitude scaling. # - "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. # - "exp_range": A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration. # - "sinus": A sinusoidal form cycle # # Example # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=2000, mode='triangular', num_iterations=20000) # > plot(clr, cex=0.2)

      # Class also supports custom scaling functions with function output max value of 1:
      # > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2)

      # # Arguments
      #   iteration:
      #       if is a number:
      #           id of the iteration where: max iteration = epochs * (samples/batch)
      #       if "iteration" is a vector i.e.: iteration=1:10000:
      #           returns the whole sequence of lr as a vector
      #   base_lr: initial learning rate which is the
      #       lower boundary in the cycle.
      #   max_lr: upper boundary in the cycle. Functionally,
      #       it defines the cycle amplitude (max_lr - base_lr).
      #       The lr at any cycle is the sum of base_lr
      #       and some scaling of the amplitude; therefore 
      #       max_lr may not actually be reached depending on
      #       scaling function.
      #   step_size: number of training iterations per
      #       half cycle. Authors suggest setting step_size
      #       2-8 x training iterations in epoch.
      #   mode: one of {triangular, triangular2, exp_range, sinus}.
      #       Default 'triangular'.
      #       Values correspond to policies detailed above.
      #       If scale_fn is not None, this argument is ignored.
      #   gamma: constant in 'exp_range' scaling function:
      #       gamma**(cycle iterations)
      #   scale_fn: Custom scaling policy defined by a single
      #       argument lambda function, where 
      #       0 <= scale_fn(x) <= 1 for all x >= 0.
      #       mode paramater is ignored 
      #   scale_mode: {'cycle', 'iterations'}.
      #       Defines whether scale_fn is evaluated on 
      #       cycle number or cycle iterations (training
      #       iterations since start of cycle). Default is 'cycle'.

      ########
      if(is.null(scale_fn)==TRUE){
            if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';}
            if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';}
            if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';}
            if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';}
      }
      lr <- list()
      if(is.vector(iteration)==TRUE){
            for(iter in iteration){
                  cycle <- floor(1 + (iter / (2*step_size)))
                  x2 <- abs(iter/step_size-2 * cycle+1)
                  if(scale_mode=='cycle') x <- cycle
                  if(scale_mode=='iterations') x <- iter
                  lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x)
            }
      }
      lr <- do.call("rbind",lr)
      return(as.vector(lr))
}
####################
n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))
l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),
                        mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle')
plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50")
callback_log_acc_clr <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_lr,callback_logger,callback_log_acc_clr),
      verbose = 2)
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))
history
plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))
lines(rollmean(callback_log_acc_high$acc,500), col="red")
lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")
lines(rollmean(callback_log_acc_low$acc,500), col="blue")
n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))
l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),
                        mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle')
plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black")
callback_log_acc_clr2 <- LogMetrics$new()
model <- create_model()

# fit without data_augmentation ---------------------
history <- model %>% fit(
      x_train, y_train,
      batch_size = batch_size,
      epochs = epochs,
      validation_data = list(x_test, y_test),
      shuffle = TRUE,
      callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2),
      verbose = 2)
plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))
history
plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))
lines(rollmean(callback_log_acc_high$acc,500), col="red")
lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")
lines(rollmean(callback_log_acc_low$acc,500), col="blue")
lines(rollmean(callback_log_acc_clr2$acc,500), col="black")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

library(zoo)

#--

library(keras)

use_backend("tensorflow") # for using: k_clear_session()

# Data Preparation --------------------------------------------------------

cifar10 <- dataset_cifar10()

# Feature scale RGB values in test and train inputs

x_train <- cifar10$train$x/255

x_test <- cifar10$test$x/255

y_train <- to_categorical(cifar10$train$y, num_classes = 10)

y_test <- to_categorical(cifar10$test$y, num_classes = 10)

# train dataset

dim(x_train)

dim(y_train)

# test dataset

dim(x_test)

dim(y_test)

categ <- c("plane",

"auto",

"bird",

"cat",

"deer",

"dog",

"frog",

"horse",

"ship",

"truck")

# Parameters --------------------------------------------------------------

batch_size <- 32

epochs <- 75 # imager is based on CImg, for install imager and other dependencies see: https://github.com/dahtah/imager # in ubuntu needs to install Cairo: # > sudo apt-get install libcairo2-dev

# > sudo apt-get install libxt-dev

library(imager)

IM <- list()

for(i in 1:(15*30)) IM[[i]] <- as.cimg(aperm(x_train[i,,,], c(2,1,3)), dim=c(32,32,3))

par(mfrow=c(15,30), mar=c(0,0,0.5,0))

for(i in 1:(15*30)){

plot(IM[[i]], axes=FALSE)

title(categ[which.max(y_train[i,])], cex.main=1.0)

}

create_model <- function(Learning_rate=0.001, decay=0){

# Defining Model ----------------------------------------------------------

# KERAS clear session

k_clear_session()

use_session_with_seed(1, disable_gpu=FALSE, disable_parallel_cpu=FALSE) # seed: for reproducible research

# Initialize sequential model

model <- keras_model_sequential() model %>%

# Start with 2 hidden 2D convolutional layer input fed 32x32 pixel images

layer_conv_2d(filter=32, kernel_size=c(3,3), padding="same",

input_shape=c(32,32,3), kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=32, kernel_size=c(3,3),

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.40) %>%

# 2 additional hidden 2D convolutional layers

layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=64, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.40) %>%

# 2 additional hidden 2D convolutional layers

layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

##--

layer_conv_2d(filter=128, kernel_size=c(3,3), padding="same",

kernel_regularizer=regularizer_l2(1e-4)) %>%

layer_activation("relu") %>%

layer_batch_normalization() %>%

# max pooling

layer_max_pooling_2d(pool_size=c(2,2)) %>%

layer_dropout(0.50) %>%

# flatten the output into 10 unit output layer

layer_flatten() %>%

layer_dense(10, kernel_initializer=initializer_glorot_normal(seed=1)) %>%

layer_activation("softmax")

##------

model %>% compile(

loss = "categorical_crossentropy",

optimizer = optimizer_rmsprop(lr=Learning_rate, decay=decay), # default if NULL lr=0.001

metrics = "accuracy"

)

return(model)

}

create_model()

LogMetrics <- R6::R6Class("LogMetrics",

inherit = KerasCallback,

public = list(

loss = NULL,

acc = NULL,

on_batch_end = function(batch, logs=list()) {

self$loss <- c(self$loss, logs[["loss"]])

self$acc <- c(self$acc, logs[["acc"]])

}

))

callback_lr_init <- function(logs){

iter <<- 0

lr_hist <<- c()

iter_hist <<- c()

}

callback_lr_set <- function(batch, logs){

iter <<- iter + 1

LR <- l_rate[iter] # if number of iterations > l_rate values, make LR constant to last value

if(is.na(LR)) LR <- l_rate[length(l_rate)]

k_set_value(model$optimizer$lr, LR)

}

callback_lr_log <- function(batch, logs){

lr_hist <<- c(lr_hist, k_get_value(model$optimizer$lr))

iter_hist <<- c(iter_hist, k_get_value(model$optimizer$iterations))

}

callback_lr <- callback_lambda(on_train_begin=callback_lr_init, on_batch_begin=callback_lr_set)

callback_logger <- callback_lambda(on_batch_end=callback_lr_log)

## Varing LR

# we set low epochs

epochs_find_LR <- 5

# learning rate searcher

lr_max <- 0.1

n_iter <- ceiling(epochs_find_LR * (NROW(x_train)/batch_size))

growth_constant <- 15

# our learner will be an exponential function:

l_rate <- exp(seq(0, growth_constant, length.out=n_iter))

l_rate <- l_rate/max(l_rate)

l_rate <- l_rate * lr_max

plot(l_rate, type="b", pch=16, cex=0.1, xlab="iteration", ylab="learning rate")

plot(l_rate, type="b", log="y",pch=16, cex=0.1, xlab="iteration", ylab="learning rate (log scale)")

callback_log_acc_lr <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs_find_LR,

shuffle = TRUE,

callbacks = list(callback_lr, callback_logger, callback_log_acc_lr),

verbose = 2)

plot(lr_hist, callback_log_acc_lr$acc, log="x", type="b", pch=16, cex=0.3, xlab="learning rate (log scale)", ylab="accuracy")

Learning_rate_l <- 2e-5

Learning_rate_h <- 8e-4

plot(rollmean(lr_hist, 100), rollmean(callback_log_acc_lr$acc, 100), log="x", type="l", pch=16, cex=0.3, xlab="learning rate", ylab="accuracy: rollmean(100)")

abline(v=8e-6, col="grey60")

abline(v=2e-3, col="grey60")

abline(v=Learning_rate_l, col="blue")

abline(v=Learning_rate_h, col="red")

# Training ----------------------------------------------------------------

callback_log_acc_low <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_l, decay=0)

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_low),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history

plot(callback_log_acc_low$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="blue")

plot(rollmean(callback_log_acc_low$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="blue")

# Training ----------------------------------------------------------------

callback_log_acc_high <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=0)

# #fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_high),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history

plot(callback_log_acc_high$acc, type="l", cex=0.2, xlab="iteration", ylab="accuracy", col="red")

plot(rollmean(callback_log_acc_high$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="red")

# Training ----------------------------------------------------------------

callback_log_acc_high_decay <- LogMetrics$new()

model <- create_model(Learning_rate=Learning_rate_h, decay=1e-3)

# #fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_log_acc_high_decay),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history

plot(rollmean(callback_log_acc_high_decay$acc, k=500), type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", col="orange")

####################

# Class also supports custom scaling functions with function output max value of 1:

# > clr_fn <- function(x) 1/x # > clr <- Cyclic_LR(base_lr=0.001, max_lr=0.006, step_size=400, # scale_fn=clr_fn, scale_mode='cycle', num_iterations=20000) # > plot(clr, cex=0.2)

# # Arguments

# iteration:

# if is a number:

# id of the iteration where: max iteration = epochs * (samples/batch)

# if "iteration" is a vector i.e.: iteration=1:10000:

# returns the whole sequence of lr as a vector

# base_lr: initial learning rate which is the

# lower boundary in the cycle.

# max_lr: upper boundary in the cycle. Functionally,

# it defines the cycle amplitude (max_lr - base_lr).

# The lr at any cycle is the sum of base_lr

# and some scaling of the amplitude; therefore

# max_lr may not actually be reached depending on

# scaling function.

# step_size: number of training iterations per

# half cycle. Authors suggest setting step_size

# 2-8 x training iterations in epoch.

# mode: one of {triangular, triangular2, exp_range, sinus}.

# Default 'triangular'.

# Values correspond to policies detailed above.

# If scale_fn is not None, this argument is ignored.

# gamma: constant in 'exp_range' scaling function:

# gamma**(cycle iterations)

# scale_fn: Custom scaling policy defined by a single

# argument lambda function, where

# 0 <= scale_fn(x) <= 1 for all x >= 0.

# mode paramater is ignored

# scale_mode: {'cycle', 'iterations'}.

# Defines whether scale_fn is evaluated on

# cycle number or cycle iterations (training

# iterations since start of cycle). Default is 'cycle'.

########

if(is.null(scale_fn)==TRUE){

if(mode=='triangular'){scale_fn <- function(x) 1; scale_mode <- 'cycle';}

if(mode=='triangular2'){scale_fn <- function(x) 1/(2^(x-1)); scale_mode <- 'cycle';}

if(mode=='exp_range'){scale_fn <- function(x) gamma^(x); scale_mode <- 'iterations';}

if(mode=='sinus'){scale_fn <- function(x) 0.5*(1+sin(x*pi/2)); scale_mode <- 'cycle';}

}

lr <- list()

if(is.vector(iteration)==TRUE){

for(iter in iteration){

cycle <- floor(1 + (iter / (2*step_size)))

x2 <- abs(iter/step_size-2 * cycle+1)

if(scale_mode=='cycle') x <- cycle

if(scale_mode=='iterations') x <- iter

lr[[iter]] <- base_lr + (max_lr-base_lr) * max(0,(1-x2)) * scale_fn(x)

}

lr <- do.call("rbind",lr)

return(as.vector(lr))

}

####################

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))

l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),

mode='triangular', gamma=1, scale_fn=NULL, scale_mode='cycle')

plot(l_rate, type="b", pch=16, xlab="iteration", cex=0.2, ylab="learning rate", col="grey50")

callback_log_acc_clr <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_lr,callback_logger,callback_log_acc_clr),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))

lines(rollmean(callback_log_acc_high$acc,500), col="red")

lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")

lines(rollmean(callback_log_acc_low$acc,500), col="blue")

n_iter <- ceiling(epochs * (NROW(x_train)/batch_size))

l_rate <- Cyclic_LR(iteration=1:n_iter, base_lr=Learning_rate_l, max_lr=Learning_rate_h, step_size=floor(n_iter/75),

mode='exp_range', gamma=0.99997, scale_fn=NULL, scale_mode='cycle')

plot(l_rate, type="b", pch=16, xlab="iter", cex=0.2, ylab="learning rate", col="black")

callback_log_acc_clr2 <- LogMetrics$new()

model <- create_model()

# fit without data_augmentation ---------------------

history <- model %>% fit(

x_train, y_train,

batch_size = batch_size,

epochs = epochs,

validation_data = list(x_test, y_test),

shuffle = TRUE,

callbacks = list(callback_lr,callback_logger,callback_log_acc_clr2),

verbose = 2)

plot(history, theme_bw=getOption("keras.plot.history.theme_bw", TRUE))

history

plot(rollmean(callback_log_acc_clr$acc,500), col="grey50", type="l", cex=0.2, xlab="iteration", ylab="accuracy: rollmean(500)", ylim=c(0,1))

lines(rollmean(callback_log_acc_high$acc,500), col="red")

lines(rollmean(callback_log_acc_high_decay$acc,500), col="orange")

lines(rollmean(callback_log_acc_low$acc,500), col="blue")

lines(rollmean(callback_log_acc_clr2$acc,500), col="black")

Share it!:

un blog sobre "data science"

Learning rate finder and Cyclical Learning Rates for Neural Networks with CIFAR10 – Keras R

Loading the dataset

Neural network configuration

Callback functions

Callback for logging metrics on each iteration

Callbacks for changing the learning rate on each iteration

Finding the best learning rates boundaries

Training the model: low LR

Training the model: high LR

Training the model: high LR (with decay)

Cyclical Learning Rate function

Training the model: Cyclical Learning Rate

Training the model: Cyclical Learning Rate (with decay)

Confusion matrix

Conclusion

Leave a Reply Cancel reply