Build a Convolutional Neural Network for Beginners
In this article, I will discuss some basic concepts of a convolutional neural network and how to properly tune a model.
What is Convolutional Neural Network and how does it work?
Convolutional Neural Network, also called CNN, is a type of deep learning model that frequently applied in image classification. CNN is constructed by three types of repetition building blocks: convolutional layers, pooling layers, and fully connected layers.
The convolutional layer, which performs feature extraction, is the fundamental component of CNN architecture. This layer uses specialized linear algebra operation by taking a filter (also called the kernel) moving across the image. The filter usually sets as an odd numbers. The resulting matrix is called a feature map. This procedure is repeated for multiple filters to form an arbitrary number of feature maps.
After a convolutional layer, those feature maps are fed into a pooling layer. The pooling layer performs dimension reduction by performing a downsampling operation. There are two types of pooling: max pooling and average pooling. Max pooling (most popular) extracts the maximum value from each patch (which relates to the pooling size)and discards all other patches. Average pooling takes the average value in each patch. The result feature maps with reduced dimensionality are passed through the next convolutional layer to perform more pattern finding.
The last reduced dimensional feature maps are passed through a flattened layer. The flattened layer transformed the 2D matrix into a 1D vector and pass through one (or more) fully connected layer (also known as a dense layer) to make predictions of images.
Here is an example of three blocks of CNN.
Tunning CNN arguments
Activation Function
The activation function (also known as transfer function) helps neural networks to use more important information by transforming input neurons (also called nodes) into certain output neurons.
ReLU (Rectified Linear Unit)
Pros
- Most commonly used activation function in CNN
- Positive inputs will output positives, and negative inputs will output 0
- Computational efficient
- Does not cause vanishing gradient problem
Cons
- “Dying ReLu / Zero dying” — if too much output getting 0, most neurons will simply output 0 and prohibited model learning
Leaky ReLu
Pros
- Similar to ReLu, but does not suffer“zero dying”
Cons
- α need to be defined prior to the training
The activation function for the last fully connected layer is determined by the type of classification.
- For a binary classification problem, “sigmoid” will be used.
- For a multiclass classification problem, “softmax” will be used
Loss Function
Loss function (also known as the cost function) evaluates the performance of the model by measuring the difference between true label and prediction label
The loss function at the compile step is determined by the type of classification.
- For a binary classification problem, “binary_crossentropy” will be used.
- For a multiclass classification problem, “categorical_crossentropy” will be used
Optimizer
The optimizer updates the model’s weight or learning rates in response to minimize the loss function.
SGD (Stochastic Gradient Descent)
- Updates the weights frequently when new data is fed in
- Constant learning rates, high variance
Adagrad
- Changes learning rate based on previous gradient (Adaptive learning)
- Quickly drop learning rates thus slow training
RMSProp (Root Mean Square Prop)
- (Adaptive learning) Using moving average of squared gradient to solve the problem of diminished learning rates
Adam (Adaptive Moment Estimation)
- The most widely used optimization
- Benefits of both Adagrad and RMSProp
Regularization
Prevent overfitting
Dropout
- Ignore some layers’ output to prevent all the neurons converge to the same goal
Kernel Regularizer (L1 and L2)
The process of shrink coefficients so that the weights won’t get too large and prevent model overfitting.
L1
- Encourage weights to be 0 if possible thus resulting in more sparse weights
L2
- Penalize large weights severely thus resulting in less sparse weights
Fit a Deep Learning Model
For machine learning purposes, usually, the whole dataset will be divided into three parts: train, test, validation set. The reason for split into 3 parts is because we want to train the model and use a validation part for model tuning. After finding the best model, the true holdout will be used for final evaluation. As the above code shown, some people use a test set as the validation at the fit step, others vice versa. No matter use test or val at validation, keep in mind, train set always account for the bigger part.