The story of how ML was created lies in the answer to this apparently simple and direct question. In much of research, often the simplest questions lead to the most profound answers.
So, the story of ML begins in the late 1950s when a neuroscientist named Roger Rosenblatt invented a computational model of the brain he called “perceptron”.
I’m going to literally cite the description of the announcement of perceptron from Wikipedia because in a truly “deja Vu” way what’s happening in 2019 so eerily resembles what happened 60 years ago. So, here is the Wikipedia citation:
“The perceptron algorithm was invented in 1958 at the Cornell Aeronautical Laboratory by Frank Rosenblatt [3] funded by the United States Office of Naval Research [4].
The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the "Mark 1 perceptron". This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the "neurons". Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors. “
Now, get ready for the creepy part. Again, citing Wikipedia:
“In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
Doesn’t this irrational exuberance 60 years ago so eerily resemble the current media frenzy about AI? So, what happened next in our story of the humble perceptron model? Well, one of the founders of AI, Professor Marvin Minsky of MIT and his colleague Seymour Papert published a dazzling takedown of the perceptron model as a classic book that is one of the genuine cornerstones of computational learning theory.
Minsky and Papert showed that far from the media frenzy about a machine that could “walk, talk, see, write and reproduce itself”, the perceptron was severely limited in its ability at learning functions from data. In their enthusiasm researchers had forgotten to ask the more basic question: are there limits to its power to learn? In particular, perceptions could not learn simple functions like XOR, because these are not “linearly separable”.
As the figure shows, the AND and OR Boolean functions are learnable by perceptrons because the positive and negative examples can be separated by a line. Not so with the XOR. So, the bubble burst and work on neural networks came crashing to a halt. Minsky later said that was not his intention, which was to show the flaw on the original model.
The next advance came around 30 years later when multilayer feedforward networks were invented, and the famous back propagation algorithm was published by Geoff Hinton and colleagues in the mid 1980s. I took Hinton’s neural network course at CMU around 1987, and remember the excitement. It was very similar to the feeling you get today.
The essence of the algorithm is based on computing the gradient of the error at the output with respect to parameters that lie in interior layers. The algorithm requires only basic calculus to understand and it uses the chain rule. As it turned out, multilayer neural networks were freed from the limitations of perceptrons. Using one of the cornerstone theorems in functional analysis called the “ Hahn Banach” theorem, a Dartmouth mathematician proved that indeed multilayer neural networks could represent any continuous function. The original proof needs some deep math, but there are simpler presentations.
Neural networks and deep learning
So, 30 years later, does our story have a happy ending? Far from it. We are back again facing the same issues. Like the famous Groundhog Day movie, it seems the same scenario keeps playing endlessly in ML and AI. The problem is that these theorems about multilayer neural networks say nothing about the ability of backpropagation to “learn” any continuous function from data. All they prove is that there are a set of weights under which multilayer neural networks can “represent” any function.
Remarkably, we are almost in 2020, and the situation has not changed. There is no proof yet that multilayer neural networks can be trained to learn any smooth function from data. Meanwhile, the situation is getting perilous because these multilayer networks are being used in millions of real world life or death applications.
For example, my Tesla model S P100D uses multilayer networks to implement a simple form of autonomous driving. It would be good for me to know I can trust my car to take me safely to work!
So, there is starting to be some glimmer of hope, but the picture remains murky. A number of recent theoretical papers have shown that infinitely wide neural networks reduce to a fairly simple Gaussian process model whose dynamics can be studied in a tractable manner. Without getting too technical, a certain positive definite matrix can be constructed whose spectral properties (meaning eigenvalues) don’t change too much from random initialization to final tuned values. This matrix is called the neural tangent kernel.
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
There’s also a more confusing picture emerging that shows in massively overparameterized neural networks, there are lottery tickets, in other words, really small networks that work almost as well.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
in other words, you can throw away 90% of the weights and still get an effective network! What’s going on? I’ll let you in on an insider secret. No one knows! It’s a mystery. Scary but true. We are reliving the perceptron nightmare. Quoting from the abstract of this now famous 2018 paper, it reads:
”Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. ”
So, 60 years from the announcement of perceptrons, we are still as unsure about the true power of neural networks. But there’s a trillion dollar industry at stake in this outcome. This is better than any Hollywood thriller. Stay tuned!