Let’s Explore Convolutional Neural Network:

9 min readMar 21, 2022


Artificial intelligence has grown dramatically in its ability to bridge the gap between human and machine capabilities. Researchers and enthusiasts alike work on various aspects of the field to achieve amazing results. The domain of computer vision is one of many such areas.

The goal of this field is to enable machines to perceive the world in the same way that humans do, and to use that knowledge for a variety of tasks such as image and video recognition, image analysis and classification, media recreation, recommendation systems, natural language processing, and so on. Deep Learning advancements in Computer Vision have been built and perfected over time, primarily over one specific algorithm — a Convolutional Neural Network.

Let us discuss it in detail in the form of questions and answers.

What is a convolutional neural network? And why does CNN work better with image data?

A convolutional neural network (CNN) is a type of artificial neural network used in image recognition and processing that is specifically designed to process pixel data. The core idea about convolutional neural networks is that, contrary to fully-connected layers, instead of assigning different weights per each pixel of the picture, you have some kernel that is smaller than the input picture and slides through it. What follows, we apply the same set of weights to different parts of the picture (so-called weight or parameter sharing). By this, we hope to detect the same patterns in different parts of the image.

The practical benefit is that having fewer parameters greatly improves the time it takes to learn as well as reduces the amount of data required to train the model.

In Simple words, the two main advantages of CNN to work better with image data are:

1) Parameter Sharing:

A feature detector (such as a vertical edge detector or horizontal edge detector) that’s useful in one part of the image is probably useful in another part of the image.

2) Sparsity of connections:

In each layer, each output value depends only on a small number of inputs, instead of taking into account all the inputs.

Sparsity of connections:

What is the role of the convolution operator on CNN?

The purpose of the convolution operation is to extract the high-level features such as edges, from the input image. A convolution is essentially sliding a filter over the input. It’s simply a mathematical operation (referred to as the term convolution) that takes two inputs such as an image matrix and a set of filters whose parameters need to be learned.

convolution operator on CNN

What are the different layers of CNN? What is the pooling layer? Why do we use it?

There are three types of layers in a convolutional neural network: convolutional layer, pooling layer, and fully connected layer. Each of these layers has different parameters that can be optimized and performs a different task on the input data.

different layers of CNN

A pooling layer is the building block of a CNN. A pooling layer is a new layer added after the convolutional layer. Its main function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network. The pooling layer operates on each feature map independently. The most common approach used in pooling is max pooling.

CNN uses pooling layers to reduce the size of the input image so that it speeds up the computation of the network.

pooling layer

One interesting property of the pooling layer is that it has a set of hyperparameters but it has no parameters to learn.

Why do we use the padding technique on CNN?

We can observe that the size of the output is smaller than the input. To maintain the dimension of output as in input, we use padding. Padding is a process of adding zeros to the input matrix symmetrically. padding is added to the outer frame of the image to allow for more space for the filter to cover in the image. Adding padding to an image processed by a CNN allows for a more accurate analysis of images.

padding technique on CNN

In Simple, the main two downsides of using Convolutional operation are that our image will get shrinks after applying convolution and we are throwing away a lot of information near the edge of the image. To overcome these problems, we use the padding technique in CNN.

What are Residual Networks? And why do we use it?

Residual networks solve the degradation problem (problem of the vanishing/exploding gradient) by shortcuts or skip connections, by short-circuiting shallow layers to deep layers. We can stack Residual blocks more and more, without degradation in performance. This enables very deep networks to be built.

Residual Networks

We know that if you make a network deeper it can hurt your ability to train the network to do well on the training set. But using residual blocks allows you to train much deeper neural networks. In Residual networks, we use a technique called skip connections. The skip connection skips training from a few layers and connects directly to the output.

The main advantages of ResNets are:

1) Networks with a large number (even thousands) of layers can be trained easily without increasing the training error percentage.

2) ResNets help in tackling the vanishing gradient problem using identity mapping.

What is the practical advice for using convolutional neural networks? How can transfer learning help us?

Data Augmentation: Most of the computer vision tasks could use more data, in practice there almost all computer vision tasks having more data will help. we can use some Data Augmentation Techniques to get more data from our existing data.

Data Augmentation

Common augmentation methods are Mirroring, Random cropping, rotating, shearing, etc.

Transfer Learning: The basic premise of transfer learning is simple: take a model trained on a large dataset and transfer its knowledge to a smaller dataset. For object recognition with a CNN, we freeze the early convolutional layers of the network and only train the last few layers which make a prediction. Transfer learning, used in machine learning, is the reuse of a pre-trained model on a new problem. In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another.

Transfer Learning

How is the object detection task different from the image classification task?

The difference between object detection algorithms and classification algorithms is that in detection algorithms, we try to draw a bounding box around the object of interest to locate it within the image. Also, you might not necessarily draw just one bounding box in an object detection case, there could be many bounding boxes representing different objects of interest within the image and you would not know how many beforehand.

In simple, Image classification involves predicting the class of one object in an image. Object localization refers to identifying the location of one or more objects in an image and drawing a bounding box around their extent. Object detection combines these two tasks and localizes and classifies one or more objects in an image.

What are the algorithms that we can use for object detection?

Top 8 Algorithms for Object Detection

• Fast R-CNN.

• Faster R-CNN.

• Histogram of Oriented Gradients (HOG)

• Region-based Convolutional Neural Networks (R-CNN)

• Region-based Fully Convolutional Network (R-FCN)

• Single Shot Detector (SSD)

• Spatial Pyramid Pooling (SPP-net)

• YOLO (You Only Look Once)

What is the YOLO algorithm? how do YOLO works?

YOLO (You Only Look Once) is an algorithm that uses neural networks to provide real-time object detection. This algorithm is popular because of its speed and accuracy. It has been used in various applications to detect traffic signals, people, parking meters, and animals.

How a YOLO work: YOLO algorithm divides any given input image into S x S grid system. Each grind on the input image is responsible for the detection of the object. Now the grid cell predicts the number of boundary boxes for an object.

Every boundary box has five elements (x, y, w, h, confidence score). X and y are the coordinates of the object in the input image, w and h are the width and height of the object respectively. The confidence score is the probability that the box contains an object and how accurate is the boundary box. YOLO Algorithm is based on regression where object detection and localization and classification of the object for the input image will take place in a single go.

YOLO algorithm

Why do we use the non-max suppression technique?

One of the problems of object detection is that your algorithm may find multiple detections of the same object. Non-max suppression is a way for you to make sure that your algorithm detects each object only one.

Ideally, for each object in the image, we must have a single bounding box. To select the best bounding box, from the multiple predicted bounding boxes, these object detection algorithms use non-max suppression. This technique is used to “suppress” the less likely bounding boxes and keep only the best ones.

In simple, YOLO uses Non-Maximal Suppression (NMS) to only keep the best bounding box. The first step in NMS is to remove all the predicted bounding boxes that have a detection probability that is less than a given NMS threshold.

non-max suppression technique

What is the difference between CNN and R-CNN?

A convolutional neural network (CNN) is mainly for image classification. While an R-CNN, with the R standing for the region, is for object detection.

A typical CNN can only tell you the class of the objects but not where they are located. It is actually possible to regress bounding boxes directly from a CNN but that can only happen for one object at a time. If multiple objects are in the visual field, then the CNN bounding box regression cannot work well due to interference.

In R-CNN the CNN is forced to focus on a single region at a time because that way interference is minimized because it is expected that only a single object of interest will dominate in a given region. The regions in the R-CNN are detected by a selective search algorithm followed by resizing so that the regions are of equal size before they are fed to a CNN for classification and bounding box regression.

CNN is just one of the many algorithms used in image processing and detection. Moving forward we will discuss many more algorithms involving image processing. Till then wish you good health and success. Do visit the official website of AlgoZenith if you wish to master Data Structure and Algorithms for your future Internship/Job tests.

Do let us know in the comments if you liked the content also do check out our blog series on Finance and Product Management. Do check this blog if you are searching for an Ultimate guide for your Job/ Internship. Stay tuned for more such blogs. You can also check out the previous blogs of this series on Everything about Data Science and How to approach a Data Science project for a beginner. Keep Learning Keep shining.