In this post, we will verify the technique developed in 2015 and described in the paper Cyclical Learning Rates for Training Neural Networks. This technique describes a new method for changing the learning rate on each iteration (or batch) in order to get the maximum performance of the training process, achieving the maximum accuracy of the trainning using the minimum number of epochs and consequently saving time.

The method described is called “training with Cyclical Learning Rates”. The aim of this methodology is to train the neural network with a learning rate that changes in a cyclic way for each step or mini-batch, instead of a non-cyclic learning rate that is a constant value or maybe changes using a decay on every epoch.

We will also see how to determine the “reasonable bounds” around the cyclic learning rate will oscillate. This technique is also explained in this other post, and because learning rate is maybe the most important hyperparameter for training neural networks, how to set the learning rate limits up for each network configuration could be a critical step before start training.

We will use the keras package for R for trainning the model, and it will train using Tensorflow on the backend. We will train it from scratch instead of use a pretrained model, I think it’s the best option for this post in order to see the Cyclical Learning Rates performance. Also, and because I want to focus on the cyclic learning rate methodology, we will not use data augmentation for trainning the model, this will allow us to focus on the specific part of the code for this pourpose, instead of trying to achieve maximum accuracy of the model.

It’s highly recommendable to run this example using GPU’s. I ran it on the AWS (Amazon Web Services) cloud Computing servers using a p2.xlarge instance.

First of all, you need the keras library on your system, if you need to install it go to Install Keras and the TensorFlow backend.

Loading the dataset

For this example we will use CIFAR10 small images dataset that consists of 60000 colour images (50000 training images and 10000 test images) of 32×32 pixels size. The images are classified into 10 classes: airplane, automovile, bird, cat, deer, dog, frog, horse, ship and truck.

We will load into memory the CIFAR10 dataset, because is included in the “keras” package we can load it easily using dataset_cifar10().

Next, we need to split the dataset into train and test sets. The ‘x’ matrices (x_train, x_test) contain the images (3 matrices for each image, each matrix corresponds to a RGB color code: one for red, green and blue color components), we need to normalize the image matrix data by dividing by 255. The ‘y’ matrices (y_train, y_test) contain the ‘labels’ of the data in a one-hot encoding structure.

The code of each ‘y’ label of the dataset refers to an animal or a vehicle according to the following order.

We set the batch size to 32 and the number of epochs to 75 as shown below.

In the next code, you can have a better idea fo the dataset having a look at the pictures. We can plot the pictures using the package imager that is based on CImg.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-8

Neural network configuration

Because we want to check the performance of the same neural network using different learning rate configurations, below we defined a function that creates the same neural network each time the function is called. Notice the k_clear_session() and use_session_with_seed() functions in order to have the same starting point on each new train and for reproducibility.

The network is based on three pairs of convolutional networks intercalated with max pooling, and then two dense layers at the end.

Let’s summarise the structure:

Callback functions

For our cyclic learning rate example, we need also to set some specific callback functions. Callbacks functions are functions can be run during the training process in order to do different things, such as saving the model weights after an epoch, changing hyperparameters, or writing log files. there are some predefined callback functions in keras, but you can also create your own custom callbacks.

Callback for logging metrics on each iteration

Because in a normal train process keras framework give us the training metrics (accuracy, loss) at the end of each epoch, we need to create a function for getting it on each ‘iteration’ or ‘batch’. Remember that one epoch has ‘n’ iterations, and each iterarion uses a ‘batch’ of ‘m’ images.

The next function is an R6 Class function based on the kerasCallback functions and will log the accuracy and loss into the LogMetrics object at the end of each batch. This is usefull because implementing this piece of code you will have more control how the cyclic learning rate works on each iteration.

Callbacks for changing the learning rate on each iteration

We need also to define three more callbacks functions:

  • callback_lr_init: will set the counter to zero and clear the learning rate history lr_hist and iteration history iter_hist
  • callback_lr_set: will set the learning rate according to the l_rate vector for each iteration
  • callback_lr_log: will log the learning rate value and iteration number at the objects: lr_hist and iter_hist

These functions must be embedded into the callback_lambda() function as you can see below.

Finding the best learning rates boundaries

Learning rate is one of the most important hyperparameter for training a neural network. So it’s very important to know before starting a full trainning process, in which ranges the network converges and diverges. In order to find the best learning rate boundaries you can follow the methodology of this paper, which describes that is very easy and you will only need to spend only a few epochs.

For doing that, We will train the model only five epochs (epochs_find_LR is set to 5 in this example). We also will increase the learning rate on each iteration until the maximum value of learning rate, defined by lr_max, is allowed.

In the next code, the learning rate will be increased from 0 to lr_max using an exponential function.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-14

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-14

Next, we need to init the callback_log_acc_lr and the model. After that, we will train the model using 5 epochs for finding the learning rate boundaries.

Notice, in the code below how the callback functions are implemented using callbacks=list(callback_lr, callback_logger, callback_log_acc). By this way we are telling keras to execute the callback functions on each iteration of the training process.

After finished, we can plot the accuracy against learning rate curve.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-16

For a better understanding, we can smooth the previous curve using a rolling average of 100, and add the boundaries of the learning rate. Between this range, we must expect that our network will be able to increase the accuracy, as you can see the network starts to learn around 8e-6 and finish around 2e-3.

So, adding a safety margin we can set the learning rate boundaries to Learning_rate_l = 2e-5 (blue line) and Learning_rate_h = 8e-4 (red line).

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-18

Training the model: low LR

Next, we will train the model over 75 epoch using a constant learning rate value of Learning_rate_l = 2e-5 that corresponds to the lower learning rate value found before.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-20

As you can see, the accuracy over each iteration is increasing.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-21

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-22

Training the model: high LR

Now we can train the model using the higher learning rate boundary of Learning_rate_h = 8e-4.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-24

As you can see the accuracy increases quickly and achieves a better performance at the end of the training. This behaviour can be different using other network configurations or hyperparameters. Sometimes training in the higher zone of the learning rate boundaries, the model starts increassing the accuracy and then at a given point, the accuracy comes down. In this case it could be a good strategy to stop the training when accuracy changes the trend.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-25

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-26

Training the model: high LR (with decay)

Now let’s see if how performs the model using the high boundary learning rate with a decay value. For that, we can add the argument decay=1e-4 and then retrain the model.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-28

The maximum accuracy never achieves the maximum accuracy achieved without decay. We could try to hiper-tune the decay value in order to maximize the final accuracy, but this is computationally expensive in a trial and error process.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-29

In the next part, we will show how using cyclic learning rates we can achieve higher results with less computation.

Cyclical Learning Rate function

In order to do a cyclic learning rate we need to define a function called Cyclic_LR that has been translated from python to R see here. This function will return a vector with the learning rate value for each iteration. This output vector will be used in the previous defined callback functions.

Below you can plot the output vector of the Cyclic_LR function using the mode='triangular'.

The next figure is the result of a decay value of gamma=0.99997 using the mode='exp_range' argument. The step_size (number of training iterations per half cycle) will be set to floor(n_iter/75).

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-31

Training the model: Cyclical Learning Rate

Next, we will repeat our training using the triangular cyclic learning rate showed above.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-33

In the plot below we added to the cyclic accuracy curve to the accuracy curves that we got before. As we can see, the cyclic methodology achieves the highest accuracy, so it can be a interesting tool to be applyed on the training process of neural networks.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-34

Next we will add a decay to the cyclical learning rate in order to see how performs the network.

Training the model: Cyclical Learning Rate (with decay)

Here, we will add to our triangular cyclical learning rate a decay value that can be easily found plotting the learning rate over all the expected iterations, and trying to adjust the decay for getting less scope to the boundaries at the final part of the training.

The next figure is the result of a decay value of gamma=0.99997 using the mode='exp_range' argument. The step_size (number of training iterations per half cycle) will be set to floor(n_iter/75).

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-35

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-37

Training the model we obtain the best accuracy of all the alternatives done before, and this highest accuracy has been achieved with less number of epochs.

plot of chunk KERAS_Learning_rate_finder_with_CIFAR10-38

Confusion matrix

Below we can check the confusion matrix of the Cyclical Learning Rate (with decay) model and for the test dataset.

Conclusion

Cyclical Learning Rates for training Neural Networks is a very good technique for training a neural network in an efficient way, and also achieving the maximum accuracy (or the minimum loss) as far as we have checked using the CIFAR10 dataset.

I also demonstrated that it is worth to expend some time at the starting point of training workflow, by finding the best learning rate boundaries in order to save time and computational power in the training process.

Additional notes: if we execute the code with batch 128, although a slightly lower performance is obtained, the execution time -thanks to the vectorization- is reduced considerably (around 66%). Even so, according to the tests I have done, there is no remarkable improvement using the Cyclic Learning Rate and for this specific example and dataset (using batch 128). This is surely due to the fact that with batch 128 the number of iterations is much lower (iter: 29.297) whereas with batch 32 they have been done 4 times more (iter: 117.188).

As a conclusion, it can be said that the “Cyclic Learning Rate” works better with a large number of iterations. In addition, for any batch size, and it is demonstrated in all cases the usefulness of obtaining the “Finding the best learning rates boundaries” and to use in the training the highest learning rate.


Session Info:

Appendix, all the code:

Share it!:

Leave a Reply

Your email address will not be published. Required fields are marked *