Youssef Kashef's answer to How does the conversion of last layers of CNN from fully connected to fully convolutional allow it to process images of different size?

How does the conversion of last layers of CNN from fully connected to fully convolutional allow it to process images of different size?

Youssef Kashef

PhD student in Computer Vision and Machine Learning · Upvoted by

Quora User

, PhD Student in Machine Learning · Author has 122 answers and 709.6K answer views · 9y ·

Originally Answered: How does the conversion of last layers of CNN from fully connected to convolutional fully connected allows it to process images of different size? ·

"There's no such thing as fully connected layer" (Yann LeCun - In Convolutional Nets, there is no such thing...)

In short, the decision making layers at the end of an conv. net may just as well be kernels of a conv. layer. If you used the weights of these layers as weights of a kernel and convolved it with the feature maps produced by the preceeding convolutions you are effectively performing the classification on a local patch. This yields a coarse localization of which classes were recognized were.

Some basics first

Conv. Nets for image classification commonly take an image (3-dim matrix of size Width x Height x Channels) as input to produce a vector output (length K). The vector represents the probability for the presence of K object types in the image that the network has learned about.
The first conv. layer transforms the image of size WxH and C x channels into F_1 x feature maps. The 2nd conv. layer takes the F_1 feature maps and transforms those into F_2 feature maps and so on. We often have non-linearity and pooling between convolutions. The idea behind that is to use the conv. layers for learning hierarchical features from the images. Pooling aggregates the information we get from convolution which is likely to be more informative without needing that high of a resolution. That's why we often see funnel-shaped diagrams of conv. nets. It looks like the network converges from high resolution information to reduced but highly informative space. At some point we're done with extracting features from the image and want to get onto classification. You can append a classifier like multi-layer-perceptrons (MLP), logistic regression or classification that will take the latest feature maps and perform classification. MLP is a the type of classifier that is commonly chosen, because you can train its weights as you learn the weights of the conv. layers via backpropagation. MLP fit nicely into Conv. Nets.
Here's where the 'fully-connected' comes in. At the point where the MLP gets its input from the final conv. layer, it no longer cares about the spatial arrangement of the feature map. Spatial relationships are discarded and each conv. layer output is connected to each neuron of the MLP.

Fully-connected vs. Fully-convolutional
The distinction between both is not an entirely a new thing. It was used a while back in the context of multi-digit classification in an image. Instead of classifying a single digit image you sweep your network over an image with multiple digits, so that individual digits of the a larger number or zip-code are presented to the model and classified individually and in sequence. The MNIST Demos on Yann LeCun's website demonstrate the idea. Traversing an image with a conv. net is presented in a publication by Matan et al. in NIPS back in 1992 titled Multi-Digit Recognition Using a Space Displacement Neural Network

Thank you for the A2A

9.2K views ·

View upvotes

View 1 share

· Answer requested by

Amarjot Singh

1 of 2 answers

Something went wrong. Wait a moment and try again.

View 1 other answer to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·