Build a Convolutional Neural Network for Beginners

Jonna Wang
4 min readMay 23, 2021

--

In this article, I will discuss some basic concepts of a convolutional neural network and how to properly tune a model.

by José Alberto Benítez-Andrades from Research Gate

What is Convolutional Neural Network and how does it work?

Convolutional Neural Network, also called CNN, is a type of deep learning model that frequently applied in image classification. CNN is constructed by three types of repetition building blocks: convolutional layers, pooling layers, and fully connected layers.

The convolutional layer, which performs feature extraction, is the fundamental component of CNN architecture. This layer uses specialized linear algebra operation by taking a filter (also called the kernel) moving across the image. The filter usually sets as an odd numbers. The resulting matrix is called a feature map. This procedure is repeated for multiple filters to form an arbitrary number of feature maps.

by Andrew Ng

After a convolutional layer, those feature maps are fed into a pooling layer. The pooling layer performs dimension reduction by performing a downsampling operation. There are two types of pooling: max pooling and average pooling. Max pooling (most popular) extracts the maximum value from each patch (which relates to the pooling size)and discards all other patches. Average pooling takes the average value in each patch. The result feature maps with reduced dimensionality are passed through the next convolutional layer to perform more pattern finding.

by Muhamad Yani from Research Gate

The last reduced dimensional feature maps are passed through a flattened layer. The flattened layer transformed the 2D matrix into a 1D vector and pass through one (or more) fully connected layer (also known as a dense layer) to make predictions of images.

Here is an example of three blocks of CNN.

Tunning CNN arguments

Activation Function

The activation function (also known as transfer function) helps neural networks to use more important information by transforming input neurons (also called nodes) into certain output neurons.

from knowledge Transfer

ReLU (Rectified Linear Unit)

Pros

  • Most commonly used activation function in CNN
  • Positive inputs will output positives, and negative inputs will output 0
  • Computational efficient
  • Does not cause vanishing gradient problem

Cons

  • “Dying ReLu / Zero dying” — if too much output getting 0, most neurons will simply output 0 and prohibited model learning

Leaky ReLu

Pros

  • Similar to ReLu, but does not suffer“zero dying”

Cons

  • α need to be defined prior to the training

The activation function for the last fully connected layer is determined by the type of classification.

  • For a binary classification problem, “sigmoid” will be used.
  • For a multiclass classification problem, “softmax” will be used

Loss Function

Loss function (also known as the cost function) evaluates the performance of the model by measuring the difference between true label and prediction label

The loss function at the compile step is determined by the type of classification.

  • For a binary classification problem, “binary_crossentropy” will be used.
  • For a multiclass classification problem, “categorical_crossentropy” will be used

Optimizer

The optimizer updates the model’s weight or learning rates in response to minimize the loss function.

SGD (Stochastic Gradient Descent)

  • Updates the weights frequently when new data is fed in
  • Constant learning rates, high variance

Adagrad

  • Changes learning rate based on previous gradient (Adaptive learning)
  • Quickly drop learning rates thus slow training

RMSProp (Root Mean Square Prop)

  • (Adaptive learning) Using moving average of squared gradient to solve the problem of diminished learning rates

Adam (Adaptive Moment Estimation)

  • The most widely used optimization
  • Benefits of both Adagrad and RMSProp

Regularization

Prevent overfitting

Dropout

  • Ignore some layers’ output to prevent all the neurons converge to the same goal

Kernel Regularizer (L1 and L2)

The process of shrink coefficients so that the weights won’t get too large and prevent model overfitting.

L1

  • Encourage weights to be 0 if possible thus resulting in more sparse weights

L2

  • Penalize large weights severely thus resulting in less sparse weights

Fit a Deep Learning Model

For machine learning purposes, usually, the whole dataset will be divided into three parts: train, test, validation set. The reason for split into 3 parts is because we want to train the model and use a validation part for model tuning. After finding the best model, the true holdout will be used for final evaluation. As the above code shown, some people use a test set as the validation at the fit step, others vice versa. No matter use test or val at validation, keep in mind, train set always account for the bigger part.

--

--

Jonna Wang
Jonna Wang

No responses yet