An intro to Convolutional Neural Networks (CNN)
In this article, I will talk about some notion in CNN, thoses titles discussed here was presented by Udacity Bertelsmann Scholarship .
What is CNN?
Computer vision is evolving rapidly day-by-day. Its one of the reason is deep learning. When we talk about computer vision, a term convolutional neural network( abbreviated as CNN) comes in our mind because CNN is heavily used here. Examples of CNN in computer vision are face recognition, image classification etc. It is similar to the basic neural network. CNN also have learnable parameter like neural network i.e, weights, biases etc.
Application of CNN?
CNNs achieve state of the art results in, a variety of problem areas including Voice User Interfaces, Natural Language Processing, and computer vision.
- WaveNet by Google:
In the field of Voice User Interfaces, Google made use of CNNs in its recently released WaveNet model [1]. This model takes any piece of text as input and does an excellent job of returning computer-generated audio of a human reading the text. What’s really cool is that if you supply the algorithm with enough samples of your voice, it’s possible to train it to sound just like you. As for the field of Natural Language Processing.
2. Sentiment classification:
Sentiment Analysis examines the problem of studying texts, like posts and reviews, uploaded by users on microblogging platforms, forums, and electronic businesses, regarding the opinions they have about a product, service, event, person or idea [2]. For example, is the writer happy or sad? If they’re talking about a movie, did they like or dislike it?
3. Play video games such as Atari Breakout:
The CNN-based models are able to learn to play games without being given any prior knowledge of what a ball is. And without even being told precisely what the controls do, the agent only sees the screen and its score but it does have access to all of the controls that you’d give a human user. With this limited knowledge, CNNs can extract crucial information that allows them to develop a useful strategy. CNNs have even been trained to play Pictionary.
4. Go:
It’s an ancient Chinese board game considered one of the most complex games in existence. It said that there are more configurations in the game than there are atoms in the universe. Recently, researchers from Google’s DeepMind use CNNs to train an artificially intelligent agent to beat human professional Go players.
5. Drones:
CNNs also allowed drones to navigate unfamiliar territory. they are now used to deliver medical supplies to remote areas. And CNNs give the drone the ability to see or to determine what’s happening in streaming video data.
Convolutional neural networks and how they improve our ability to classify images. In general, CNNs can look at images as a whole and learn to identify spatial patterns such as prominent colors and shapes, or whether a texture is fuzzy or smooth and so on. The shapes and colors that define any image and any object in an image are often called features.
What is a feature?
A helpful way to think about what a feature is, is to think about what we are visually drawn to when we first see an object and when we identify different objects. For example, what do we look at to distinguish a cat and a dog? The shape of the eyes, the size, and how they move are just a couple of examples of visual features.
As another example, say we see a person walking toward us and we want to see if it’s someone we know; we may look at their face, and even further their general shape, eyes (and even color of their eyes). The distinct shape of a person and their eye color a great examples of distinguishing features!.
How Computers Interpret Images
Any gray scale image is interpreted by a computer as an array. A grid of values for each grid cell is called a pixel, and each pixel has a numerical value. Each image in the MNIST database is 28 pixels high and wide. And so, it’s understood by a computer as a 28 by 28 array.
In a typical gray scale image, white pixels are encoded as the value 255, and black pixels are encoded as zero.
Gray pixels fall somewhere in between, with light-gray being closer to 255. These MNIST images have actually gone through a quick pre-processing step. They’ve been re-scaled so that each image has pixel values in a range from zero to one, as opposed to from 0–255. To go from a range of 0–255 to zero to one, you just have to divide every pixel value by 255. This step is called normalization, and it’s common practice in many deep learning techniques.
Normalization will help our algorithm to train better. The reason we typically want normalized pixel values is because neural networks rely on gradient calculations. These networks are trying to learn how important or how weighty a certain pixel should be in determining the class of an image. Normalizing the pixel values helps these gradient calculations stay consistent, and not get so large that they slow down or prevent a network from training.
So, now we have a normalized data, how might we approach the task of classifying these images? Well, you already learned one method for classification, using a multi-layer perceptron. How might we input this image data into an MLP? Recall that MLPs only take vectors as input. So, in order to use an MLP with images, we have to first convert any image array into a vector. This process is so common that it has a name, flattening.
I’ll illustrate this flattening conversion process on a small example here. In the case of a four-by-four image, we have a matrix with 16 pixel values. Instead of representing this as a four-by-four matrix,we can construct a vector with 16 entries, where the first first four entries of our vector correspond to the first wheel of our old array.The second four entries correspond to the second wheel and so on. After converting our images into vectors, they can then be fed into the input layer of an MLP.
Normalizing image inputs
Data normalization is an important pre-processing step. It ensures that each input (each pixel value, in this case) comes from a standard distribution. That is, the range of pixel values in one input image are the same as the range in another image. This standardization makes our model train and reach a minimum error, faster!
Data normalization is typically done by subtracting the mean (the average of all pixel values) from each pixel, and then dividing the result by the standard deviation of all the pixel values. Sometimes you’ll see an approximation here, where we use a mean and standard deviation of 0.5 to center the pixel values.
The distribution of such data should resemble a Gaussian function centered at zero. For image inputs we need the pixel numbers to be positive, so we often choose to scale the data in a normalized range [0,1].
MLP Structure & Class Scores
After looking at and normalizing our image data, we’ll then create a neural network for discovering the patterns in our training data. After training, our network should be able to look totally new images that it hasn’t trained on, and classify the digits contained in those images. This previously unseen data is often called test data.
At this point, our images have been converted into vectors with 784 entries. So, the first input layer in our MLP should have 784 nodes. We also know that we want the output layer to distinguish between 10 different digit types, zero through nine. So, we’ll want the last layer to have 10 nodes. So, our model will take in a flattened image and produce 10 output values, one for each possible class, zero through nine. These output values are often called class scores.
A high class score indicates that a network is very certain that a given input image falls into a certain class. You can imagine that the class score for a 103, for example, will have a high score for the class three and a low score for the classes zero, one, and so on. But it may also have a small score for an eight or any other class that looks kind of similar in shape to a three. The class scores are often represented as a vector of values or even as a bar graph indicating the relative strengths of the scores.
Now, the part of this MLP architecture that’s up to you to define is really in between the input and output layers. How many hidden layers do you want to include and how many nodes should be in each one?
This is a question you’ll come across a lot as you define neural networks to approach a variety of tasks. We usually start by looking at any papers or related work we can find that may act as a good guide. In this case, we would search for MLP for MNIST, or even more generally, MLP for classifying small greyscale images.
References:
[1]: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
[2]: https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html
[3]: Udacity Course