AlexNet: ImageNet Classification with Deep Convolutional Neural Networks — 2012

Jinpeng Zhang
8 min readDec 31, 2024

--

This work was made by Alex Krizhevsky, Ilya Sutskever and Geoffery E. Hinton in 2012. The AlexNet won the 2012 ImageNet Challenge, and significantly improved the top-5 error rate from 26.2% to 15.3%. In my opinion, the most significant contribution of AlexNet is not improved the image classification of ImageNet by 10.9%, but showed the great potential of Deep Learning or Deep Neural Network on large-scale dataset and Revived interest in deep learning. After AlexNet, there were a lot of reaching efforts turned into Deep Neural Network area, and produced ZFNet, GoogLeNet, ResNet, and of cause Transformer in 2017, etc. So I think AlexNet is a great milestone in the history of Artificial Intelligence.

In this blog, I will introduce what special works AlexNet has done. Basically I followed the structure of the paper, introduce the architecture of AlexNet, ReLU Nonlinearity activation function AlexNet used, Training on Multiple GPUs, Local Response Normalization, Overlapping Pooling, Methods used to reduce overfitting like Data Augmentation and Dropout.

The Architecture of AlexNet

Architecture of AlexNet

AlexNet is a deep convolutional neural network with 5 convolutional layers and 3 fully-connected layers. It has 60 million parameters and 650,000 neurons. Compared to today’s LLM like 8B Llama, 405B Llama, AlexNet is very small, but back to 2012, it was one of the largest Neural Network in terms of the numbers of parameters and its computational requirements.

On the above diagram, the most left are input images, and then are 5 convolutional layers, and then 3 fully-connected layers. The AlexNet receive 224×224×3(RGB) fixed size input images, so in this paper, authors downloaded images from ImageNet and cropped out the central 256×256 part as input. The output of last layer is fed to 1000-way softmax which produces a distribution over the 1000 class labels.

From above diagram you may found that there are 3 layers labeled as “Max pooling”, they are 1st, 2nd and 5th convolutional layers, this is because the size of 2nd layer is different with 1st layer, same reason for 2nd and 5th layers. Max-pooling layers typically are used to downsample the spatial dimensions of the feature maps, retaining only the most important features while reducing computational complexity.

ReLU(Rectified Linear Units)

The most common used neuron activation functions were Sigmoid and Tanh(Hyperbolic Tangent), but these two activation functions are computational costly when the dataset is large. AlexNet used ReLU to accelerate the training speed with gradient descent. This makes large models trained on large datasets possible. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

Output of 3 different Activation Function

Training on Multiple GPUs

AlexNet used 2 GTX 580 GPUs to training the model with each has 3GB of memory. GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory.

The parallelization scheme that this paper employed essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. By fit roughly half the net into each GPU, the top-5 error rate reduced by 1.2%, compared to using 1 GPU to train a half-sized neural network.

LRN(Local Response Normalization)

The purpose of applying this LRN is to aid generalization, reduce overfitting for the trained net. The LRN implements a form of lateral inhibition inspired by the type found in real neurons, where active neurons suppress their neighbors to sharpen responses and enhance contract. This LRN was applied after ReLU outputs, and only applied to the first and second convolutional layers. Please refer to the original paper to understand more detail how neurons competes with neighbors, following diagram shows the mathematical formula of LRN. After applied this LRN, the top-5 error rate reduced by another 1.2%.

LRN Formula

Overlapping Pooling

“Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap. To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z × z centered at the location of the pooling unit. If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our network, with s = 2 and z = 3.”

This overlapping pooling reduced the top-1 and top-5 error rates by 0.4% and 0.3% respectively, as compared with the non-overlapping scheme(s=2, z=2). Meanwhile, the authors observed the overlapping pooling is more difficult to overfit during the training. I think this is because each summarizing is smoothed a little bit by these overlapped edge inputs and becomes more generalization.

Data Augmentation

Reducing overfitting is important during training. Overfitting means the model perform well for the training dataset, but bad for unseen dataset, aka the model lack of generalization and is useless. In this paper, authors applied data augmentation and dropout to reduce overfitting during training.

Regarding the data augmentation, AlexNet applied 2 different forms. The first form of data augmentation consists of generating image translations and horizontal reflections. From each 256×256 input image, random 224×224 patches are extracted, this involved selecting a random sub-region of the larger image. Each extracted patch is mirrored horizontally, this effectively doubles the number of patches generated from each image. By applying this data augmentation scheme, the model was fed with varied input patterns, and enlarged the training set by a factor of 2048 which prevent the large network just “memorize(overfitting)” the relevant small dataset of 1.2M input images.

The second form of data augmentation consists of altering the intensities of the RGB channels in training images to simulate variations in lighting conditions. This data augmentation mimics variations, ensuring the model is robust to changes in lightning. While the intensity and color of illumination vary, the object in the image remains the same. By exposing the model to diverse lighting conditions during training, it learned to focus on features invariant to such changes. This scheme reduces the top-1 error rate by over 1%.

Dropout

“Combining the predictions of many different models is a very successful way to reduce test errors, but it appears to be too expensive for big neural networks that already take several days to train. There is, however, a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout”, consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons”.

This paragraph is quite easy to understand, think about 2009 Netflix Prize, the winner team used an ensemble algorithm which combines over 100 models to make a final prediction, this is because different model has different structure, the common “knowledges and patterns” learned by different structural models don’t have dependency to specific structure which means they are generalized. The dropout scheme was applied to the first two fully-connected layers, and without dropout, the network exhibits substantial overfitting. Meanwhile, the dropout roughly doubled the number of iterations required to converge.

Some Details of Learning

Authors trained the model using Stochastic Gradient Descent with batch size of 128 examples, momentum of 0.9, and weight decay of 0.005. The model processed random 128 training samples at a time to compute gradients and update weights. The small weight decay added a penalty term to the loss to prevent large weights, it aid the model’s generalization.

Qualitative Evaluations

In the last part of this paper, authors did qualitative evaluations about the model. For example the following diagram is the learned kernels of the first convolutional layer. The above 48 feature blobs were learned in GPU1, and the bottom 48 feature blobs were learned in GPU2. This diagram shows the above 48 feature blobs captured color-agnostic patterns like shapes, textures, edges. While the bottom 48 feature blobs captured color-sensitive patterns like chromatic information. The difference between GPU1 output and GPU2 output is caused by restricted connectivity for certain layers.

The left part of following diagram shows top-5 predictions of the learned net on 8 test images. There are few things to highlight: off-center objects are recognized well(the 1st image), this means the net is robust; most of top-5 labels appears reasonable (in the 4th image, only other types of cat are considered plausible labels for the leopard), this means the net makes sense semantically.

The right part of the following diagram shows dogs and elephants retrieved in different poses still share semantic characteristics in the net’s last hidden layer, even these images’ pixel-level similarity is very low. This means the net captured semantic features of these images.

--

--

Jinpeng Zhang
Jinpeng Zhang

Written by Jinpeng Zhang

Director of Engineering @ TiDB, focus on building large scale distributed system and high performance engineering team.

No responses yet