Calculating Parameters of Convolutional and Fully Connected Layers with Keras

Explain how to calculate the number of params and output shape of convolutional and pooling layers

Yan Ding
6 min readOct 15, 2020

When we build a model of deep learning, we always use a convolutional layer followed by a pooling layer and several fully-connected layers. It is necessary to know how many parameters in our model as well as the output shape of each layer. Let’s first see LeNet-5[1] which a classic architecture of the convolutional neural network.

Fig1. LeNet-5

The input shape is (32,32,3). The first layer is the convolutional layer, the kernel size is (5,5), the number of filters is 8. Followed by a max-pooling layer with kernel size (2,2) and stride is 2. The second layer is another convolutional layer, the kernel size is (5,5), the number of filters is 16. Followed by a max-pooling layer with kernel size (2,2) and stride is 2. The third layer is a fully-connected layer with 120 units. The fourth layer is a fully-connected layer with 84 units. The output layer is a softmax layer with 10 outputs.

Now let’s build this model in Keras.

from tensorflow.keras import Sequential
from tensorflow.keras import layers
model = Sequential()
model.add(layers.Conv2D(8,(5,5),activation='relu',input_shape=(32,32,3)))
model.add(layers.MaxPooling2D((2,2),strides=2))
model.add(layers.Conv2D(16, (5,5),activation='relu'))
model.add(layers.MaxPooling2D((2,2),strides=2))
model.add(layers.Flatten())
model.add(layers.Dense(120, activation='relu'))
model.add(layers.Dense(84, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()

We can see the summary of the model as follows:

Fig2. Model summary

1 Calculating the output shape of Conv layers

Let’s first see the orange box which is the output shape of each layer. Before we dive in, there is an equation for calculating the output of convolutional layers as follows:

The input shape is (32,32,3), kernel size of first Conv Layer is (5,5), with no padding, the stride is 1, so the output size is (32–5)+1=28. And the number of filters is 8. So the output shape of the first Conv layer is (28,28,8). Followed by a max-pooling layer, the method of calculating pooling layer is as same as the Conv layer. The kernel size of max-pooling layer is (2,2) and stride is 2, so output size is (28–2)/2 +1 = 14. After pooling, the output shape is (14,14,8). You can try calculating the second Conv layer and pooling layer on your own. We skip to the output of the second max-pooling layer and have the output shape as (5,5,16). Before feed into the fully-connected layer, we need first flatten this output. So we got the vector of 5*5*16=400. Next, we need to know the number of params in each layer.

2 Calculating number of Params

2.1 Fully Connected Layer

Remember how to calculate the number of params of a simple fully connected neural network as follows:

Fig3. A simple fully connected neural network

For one training example, the input is [x1,x2,x3] which has 3 dimensions(e.g. for house pricing prediction problem, input has [squares, number of bedrooms, number of bathrooms]). The first hidden layer has 4 units. Recap how to calculate the first-layer unit (suppose the activation function is the sigmoid function) as follows:

So the dimension of W is (4, 3), and the number of param W is 4*3, and the dimension of b is (4, 1). The total params of the first hidden layer are 4*3+4=16. More generally, we can arrive at the dimension of W and b as follows:

L is the L layer. n[L] is the number of units in the L layer. So the number of params for the L layer is:

2.2 Convolutional layer

The calculation of params of convolutional layers is different especially for volume. Suppose we have an image with size of (32,32,3), and the kernel size of (3,3), the shape of params should be (3,3,3) which is a cube as follows:

Coursera: Week 1 “Convolutions Over Volume”, Course 3 “Convolutional Neural Networks” of Deep learning Specialization[2]

The yellow cube contains all params for one filter. And don’t forget about the bias b. Each cube has one bias. So the number of params for one filter is 3*3*3 + 1 = 28. If there are 2 filters in first layer, the total number of params is 28*2 = 56. More generally, we can arrives at:

k is the kernel size, n[L] is the number of filters in layer L and n[L-1] is the number of filters in layer L-1 which is also the number of channels of the cube.

Now Let’s see our example. The blue box in Fig2 shows the number of params of each layer. Input shape is (32, 32, 3). The kernel size of the first Conv layer is (5,5) and the number of filters is 8. The number of one filter is 5*5*3 + 1=76 . There are 8 cubes, so the total number is 76*8=608.

The pooling layer has no params. The second Conv layer has (5,5) kernel size and 16 filters. Remember the cube has 8 channels which is also the number of filters of last layer. So the number of params is (5*5*8+1)*16 = 3216. Flatten the output of the second max-pooling layer and get the vector with 400 units. Flatten also has no params. The third layer is a fully-connected layer with 120 units. So the number of params is 400*120+120=48120. It can be calculated in the same way for the fourth layer and get 120*84+84=10164. The number of params of the output layer is 84*10+10=850. Now we have got all numbers of params of this model.

3 Summary

Having a good knowledge of the output dimensions of each layer and params can help to better understand the construction of the model. Furthermore, it can also help you to know how many updates each iteration does when training the model. Looking at popular models such as EfficientNet[3], ResNet-50, Xception, Inception, and BERT [4], LayoutLM[5], it is necessary to look at the model size rather than only accuracy. Because the model size affects the speed of inference as well as the computing source it would consume.

EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks
LayoutLM:Pre-training of Text and Layout for Document Image Understanding

Reference

[1] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-Based Learning Applied to Document Recognition.” PROC. OF THE IEEE, November 1998.

[2] Andrew Ng, week 1 of “Convolutional Neural Networks” Course in “Deep Learning Specialization”, Coursera.

[3] Mingxing Tan, Quoc V. Le, “EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks”. May 2019.

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”,May 2019.

[5] Yiheng Xu, Minghao Li, “LayoutLM:Pre-training of Text and Layout for Document Image Understanding”. Dec 2019.

--

--