Exercise: Machine Learning Optimizers¶

Gradient descent is one of the most popular algorithms to perform optimization and is also one of the most common ways to optimize neural networks. Nevertheless there exist even more refined variants, which we will investigate in this exercise. Six common optimizers are:

SGD Stochastic gradient descent optimizes the parameters of the network by randomly choosing a mini-batch from the entire dataset and calculating the gradient of the loss function \(J(\theta)\) with respect to these data points. The learning rate \(\eta\) specifies how large of a step we take in the direction of the gradient.

\[\theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta).\]

Momentum The momentum optimizer refines the optimization by incorporating previous gradients and adding them to a momentum vector \(\mathbf{m}\). In order to prevent the momentum from growing infinitely, one adds a friction parameter \(\beta\) which has to be chosen appropriately.

\[\mathbf{m}\leftarrow \beta \mathbf{m} + \eta \nabla_{\theta} J(\theta),\]

\[\theta \leftarrow \theta - \mathbf{m}.\]

NAG A small modification to momentum optimization is Nesterov accelerated gradient (NAG), which evaluates the gradient not at the current set of parameters \(\theta\), but slightly ahead, pointing more in the direction of the optimum.

\[\mathbf{m}\leftarrow \beta \mathbf{m} + \eta \nabla_{\theta} J(\theta + \beta \mathbf{m}),\]

\[\theta \leftarrow \theta - \mathbf{m}.\]

AdaGrad The AdaGrad algorithm focuses the gradient more towards to the optimum by scaling the weight vector along the steepest dimensions. The first step is to accumulate the square of the gradients and then to use this to scale the gradient vector. In order to prevent division by zero we add a small parameter \(\epsilon\).

\[\mathbf{s}\leftarrow \mathbf{s} + \nabla_{\theta} J(\theta) \otimes \nabla_{\theta} J(\theta),\]

\[\theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \oslash\sqrt{\mathbf{s} + \epsilon}.\]

RMSprop RMSProp is a slight modification of AdaGrad to prevent the training from slowing down too fast. This is done by accumulating only the gradients from the most recent iterations.

\[\mathbf{s}\leftarrow \beta \mathbf{s} + \left(1-\beta\right)\nabla_{\theta} J(\theta) \otimes \nabla_{\theta} J(\theta),\]

\[\theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \oslash \sqrt{\mathbf{s} + \epsilon}.\]

Adam The adaptive moment estimation (Adam) combines the ideas of momentum and RMSProp optimization and is one of the most common choices today. \(T\) represents the epoch iteration number.

\[\mathbf{m}\leftarrow \beta_1 \mathbf{m} + \left(1-\beta_1\right) \nabla_{\theta} J(\theta),\]

\[\mathbf{s}\leftarrow \beta_2 \mathbf{s} + \left(1-\beta_2\right)\nabla_{\theta} J(\theta) \otimes \nabla_{\theta} J(\theta),\]

\[\mathbf{m} \leftarrow \frac{\mathbf{m}}{1-\beta_1^T},\]

\[\mathbf{s} \leftarrow \frac{\mathbf{s}}{1-\beta_2^T},\]

\[\theta \leftarrow \theta - \eta \mathbf{m}\oslash \sqrt{\mathbf{s} + \epsilon}.\]

In order to use the optimizers already implemented in the keras framework of tensorflow, we use a function which generates and compiles a model for a given optimizer.

def build_compile(optimizer_name='SGD'):
    
    # Use the same network topology as last week
    model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), 
                          keras.layers.Dense(128, activation='relu'),
                          keras.layers.Dense(10, activation='softmax')])

    # compile the model with a cross-entropy loss and specify the given optimizer
    model.compile(optimizer=optimizer_name, loss=keras.losses.SparseCategoricalCrossentropy(),metrics=['accuracy'])
    return model

Now we generate an array of the different optimizers to iterate over in a for loop

optimizer_names = ['SGD','Momentum','Nesterov', 'RMSprop','Adagrad','Adam','NAdam']
optimizer_list = ['SGD',keras.optimizers.SGD(learning_rate=0.01, momentum=0.5, nesterov=False),keras.optimizers.SGD(learning_rate=0.01, momentum=0.5, nesterov=True), 'RMSprop','Adagrad','Adam','NAdam']

# Two arrays for training and validation performance
hist_acc = []
hist_val_acc = []

# Iterate over optimizers and train the network, using x_test and y_test as a validation set in each epoch
for item,name in zip(optimizer_list, optimizer_names):
    print("-----------------------------")
    print("Doing %s optimizer" %str(name))
    print("-----------------------------")
    
    # Get the model from our function above
    model = build_compile(item)
    
    # Train the model
    history = model.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_test, y_test))
    
    # Store the performance
    hist_acc.append(history.history['acc'])
    hist_val_acc.append(history.history['val_acc'])
    print("-----------------------------")

-----------------------------
Doing SGD optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 2s 32us/sample - loss: 0.6429 - acc: 0.8389 - val_loss: 0.3526 - val_acc: 0.9054

Epoch 50/50
60000/60000 [==============================] - 2s 26us/sample - loss: 0.0472 - acc: 0.9881 - val_loss: 0.0780 - val_acc: 0.9768
-----------------------------
-----------------------------
Doing Momentum optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 2s 30us/sample - loss: 0.5009 - acc: 0.8676 - val_loss: 0.2948 - val_acc: 0.9187

Epoch 50/50
60000/60000 [==============================] - 2s 31us/sample - loss: 0.0220 - acc: 0.9960 - val_loss: 0.0685 - val_acc: 0.9787
-----------------------------
-----------------------------
Doing Nesterov optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 2s 30us/sample - loss: 0.5109 - acc: 0.8639 - val_loss: 0.3035 - val_acc: 0.9158

Epoch 50/50
60000/60000 [==============================] - 2s 27us/sample - loss: 0.0213 - acc: 0.9963 - val_loss: 0.0685 - val_acc: 0.9779
-----------------------------
-----------------------------
Doing RMSprop optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 2s 37us/sample - loss: 0.2563 - acc: 0.9273 - val_loss: 0.1384 - val_acc: 0.9587

Epoch 50/50
60000/60000 [==============================] - 2s 34us/sample - loss: 4.0403e-04 - acc: 0.9998 - val_loss: 0.2369 - val_acc: 0.9764
-----------------------------
-----------------------------
Doing Adagrad optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples

Epoch 1/50
60000/60000 [==============================] - 2s 34us/sample - loss: 0.6485 - acc: 0.8455 - val_loss: 0.4352 - val_acc: 0.8932

Epoch 50/50
60000/60000 [==============================] - 2s 32us/sample - loss: 0.2028 - acc: 0.9443 - val_loss: 0.2022 - val_acc: 0.9420
-----------------------------
-----------------------------
Doing Adam optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 2s 30us/sample - loss: 0.2568 - acc: 0.9266 - val_loss: 0.1312 - val_acc: 0.9590

Epoch 50/50
60000/60000 [==============================] - 2s 30us/sample - loss: 0.0027 - acc: 0.9990 - val_loss: 0.1513 - val_acc: 0.9800
-----------------------------
-----------------------------
Doing NAdam optimizer
-----------------------------
Train on 60000 samples, validate on 10000 samples
Epoch 1/50
60000/60000 [==============================] - 3s 45us/sample - loss: 0.2584 - acc: 0.9272 - val_loss: 0.1445 - val_acc: 0.9564

Epoch 50/50
60000/60000 [==============================] - 2s 40us/sample - loss: 0.0027 - acc: 0.9992 - val_loss: 0.1757 - val_acc: 0.9786
-----------------------------

# summarize history for accuracy on training set
for i in range(len(optimizer_list)):
    plt.plot(hist_acc[i],'-o',label=str(optimizer_names[i]))
plt.title('model accuracy on train')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(loc='upper left')
plt.show()

png

# summarize history for accuracy on test set
for i in range(len(optimizer_list)):
    plt.plot(hist_val_acc[i],'-o', label=str(optimizer_names[i]))
plt.title('model accuracy on test')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(loc='upper left')
plt.show()

png

As already discussed in class, the Adam optimizer shows the best performance as it combines a momentum gradient approach with an adaptive learning rate. NAdam is further improvement, using the Nesterov update instead of vanilla momentum optimization.