AlexNet
AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition Challenge 2012 by a phenomenally large margin. This network showed, for the first time, that the features obtained by learning can transcend manually-designed features, breaking the previous paradigm in computer vision.
Architecture
In AlexNetʼs first layer, the convolution window shape is 11 × 11. Since most images in ImageNet are more than ten times higher and wider than the MNIST images, objects in ImageNet data tend to occupy more pixels. Consequently, a larger convolution window is needed to capture the object. The convolution window shape in the second layer is reduced to 5×5, followed by 3×3. In addition, after the first, second, and fifth convolutional layers, the network adds maximum pooling layers with a window shape of 3×3 and a stride of 2. Moreover, AlexNet has ten times more convolution channels than LeNet. After the last convolutional layer there are two fully-connected layers with 4096 outputs. These two huge fully-connected layers produce model parameters of nearly 1 GB. Due to the limited memory in early GPUs, the original AlexNet used a dual data stream design, so that each of their two GPUs could be responsible for storing and computing only its half of the model. Fortunately, GPU memory is comparatively abundant now, so we rarely need to break up models across GPUs these days (our version of the AlexNet model deviates from the original paper in this aspect).
Activation Functions
Besides, AlexNet changed the sigmoid activation function to a simpler ReLU activation function. On one hand, the computation of the ReLU activation function is simpler. For example, it does not have the exponentiation operation found in the sigmoid activation function. On the other hand, the ReLU activation function makes model training easier when using different parameter initialization methods. This is because, when the output of the sigmoid activation function is very close to 0 or 1, the gradient of these regions is almost 0, so that backpropagation cannot continue to update some of the model parameters. In contrast, the gradient of the ReLU activation function in the positive interval is always 1. Therefore, if the model parameters are not properly initialized, the sigmoid function may obtain a gradient of almost 0 in the positive interval, so that the model cannot be effectively trained
Now time to implement AlexNet using Python and MXNet Framework
Install the required library
!pip install -U mxnet-cu101==1.7.0 # Install if your system has GPU if not then ignore
!pip install mxnet
!pip install d2l
Now import all the required library
from mxnet import np,npx,init
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
net = nn.Sequential()
net.add(
# Here, we use a larger 11 x 11 window to capture objects. At the same time,
# we use a stride of 4 to greatly reduce the height and width of the output.
nn.Conv2D(96,kernel_size=11,strides=4,activation='relu'),
nn.MaxPool2D(pool_size=3,strides=2),
# Make the convolution window smaller, set padding to 2 for consistent
# height and width across the input and output, and increase the
# number of output channels
nn.Conv2D(256,kernel_size=5,padding=2,activation='relu'),
nn.MaxPool2D(pool_size=3,strides=2),
# Use three successive convolutional layers and a smaller convolution
# window. Except for the final convolutional layer, the number of
# output channels is further increased. Pooling layers are not used to
# reduce the height and width of input after the first two
# convolutional layers
nn.Conv2D(384, kernel_size=3,padding=1,activation='relu'),
nn.Conv2D(384, kernel_size=3,padding=1,activation='relu'),
nn.Conv2D(256, kernel_size=3,padding=1,activation='relu'),
nn.MaxPool2D(pool_size=3, strides=2),
nn.Dense(4096,activation='relu'),
nn.Dropout(.5),
nn.Dense(4096,activation='relu'),
nn.Dropout(0.5),
nn.Dense(10)
)
X = np.random.uniform(size=(1,1,224,224))
net.initialize()
for layer in net:
X=layer(X)
print(layer.name,'output shape:\t',X.shape)
conv2 output shape: (1, 96, 54, 54)
pool1 output shape: (1, 96, 26, 26)
conv3 output shape: (1, 256, 26, 26)
pool2 output shape: (1, 256, 12, 12)
conv4 output shape: (1, 384, 12, 12)
conv5 output shape: (1, 384, 12, 12)
conv6 output shape: (1, 256, 12, 12)
pool3 output shape: (1, 256, 5, 5)
dense0 output shape: (1, 4096)
dropout0 output shape: (1, 4096)
dense1 output shape: (1, 4096)
dropout1 output shape: (1, 4096)
dense2 output shape: (1, 10)
batch_size = 128
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
lr, num_epochs = 0.01,10
d2l.train_ch6(net,train_iter,test_iter,num_epochs,lr,d2l.try_gpu())
loss 0.331, train acc 0.879, test acc 0.890
1494.6 examples/sec on gpu(0)
# Data for prediction
for X,y in test_iter:
X=X
y=y
break
def get_fashion_mnist_labels(labels):
text_labels=['t-shirt','trouser','pullover','dress','coat','sandal','shirt','sneaker','bag','ankle boot']
return text_labels[labels]
import pandas as pd
def predict(X,y):
trl=[]
prel=[]
count = 0
X=X.as_in_ctx(d2l.try_gpu())
z=net(X).argmax(axis=1)
n=len(z)
for i,j in zip(z,y):
pre=get_fashion_mnist_labels(int(i))
tru = get_fashion_mnist_labels(int(j))
if pre==tru:
count+=1
trl.append(tru)
prel.append(pre)
label = {'true_label':trl,'predicted_label':prel}
df = pd.DataFrame(label)
print(df)
print('The accuracy:',(count/n)*100)
predict(X[:10],y)
true_label predicted_label
0 t-shirt t-shirt
1 trouser trouser
2 pullover pullover
3 pullover shirt
4 dress dress
5 pullover shirt
6 bag bag
7 shirt pullover
8 sandal sandal
9 t-shirt t-shirt
The accuracy: 70.0 `Refrences`
Leave a Comment