Good question!
I’ve included a couple additional papers in the references that prove the functional expressiveness of piecewise linear / rectifier networks. Here is a less mathematically formal explanation for the casual reader:
You are correct that a ReLU is only piecewise linear, so one might suspect that for a fixed size network, a ReLU network might not be as expressive as one with a more smooth + bounded activation function, such as tanh.
Because they learn non-smooth functions, ReLU networks should be interpreted as separating data in a piecewise linear fashion rather than actually being a “true” function approximator. In Machine Learning, one often is trying to learn from a dataset of finite, discrete data points (i.e. 100K images), and in these cases it is sufficient to just learn a separator for these datapoints.
Consider the 2-dimensional modulus operator, i.e.
- vec2 p = vec2(x,y) // x,y are floats
- vec2 mod(p,1) {
- return vec2(p.x % 1, p.y % 1)
- }
The output of the mod function is the result of folding/collapsing all of 2D space onto the unit square. This is piecewise linear, and yet highly nonlinear (because there are an infinite number of linear pieces).
Deep neural networks with ReLU activation work similarly - they partition/fold the activation space into a bunch of different linear regions like a really complex piece of origami.
See Figure 3 from “On the number of linear regions of Deep Neural Networks” by Montúfar et al. [1]
In Figure 2, they illustrate how as the depth/number of layers in the network increases, the number of linear regions grow exponentially.
It turns out that, with enough layers, you can approximate “smoothness” to any arbitrary degree. Furthermore, if you add a smooth activation function at the last layer, you do get a smooth function approximator.
Generally, we do not desire a function approximator that is so smooth that it can exactly match every datapoint and overfits the dataset instead of learning a generalizable representation that can work well on the test set. By learning separators, we get better generalizability so ReLU networks are a little bit better at self-regularizing in that sense.
References
[1] https://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.pdf
[2] [1511.05678] Expressiveness of Rectifier Networks
[3] A Comparison of the Computational Power of Sigmoid and Boolean Threshold Circuits https://pdfs.semanticscholar.org/1776/a4c69e203c5ae2eeb5ba8bed26c7464f7543.pdf