Bijay Gurung's answer to Is automatic differentiation in deep learning the same as numerical differentiation?

Is automatic differentiation in deep learning the same as numerical differentiation?

Software Engineer · 7y ·

No.

They are trying to find the same thing, or rather they can be used to do similar things, but there’s a big difference in how they go about it.

Consider a function:

[math]f(x) = \frac{1}{x^2 + 1} [/math]

We are interested in finding [math]f'(x) [/math]: how f(x) changes with respect to x.

Numerical Differentiation uses this formulation:

[math]f'(x_{0})=\frac{f(x_{0}+h)-f(x_{0})}{h}[/math]

Basically, plug in the input [math]x_{0},[/math] note the value. Then, change the input slightly by plugging in [math]x_{0} + h[/math] (where [math]h [/math]is a small increment) to see how the output changes.

Automatic Differentiation on the other hand, goes about it in steps, making use of

i) the fact that a function can be thought of as computational graph made up of “primitive” operations, and

ii) the chain rule of differentiation.

So, say we break [math]f(x)[/math] into two operations [1]:

[math]y = x^2 + 1[/math]

[math]z = 1 / y[/math]

What we can do is take each operation separately.

For y,

[math]\frac{\partial y}{\partial x} = 2 * x[/math]

And for z,

[math]\frac{\partial z}{\partial y} = - \frac{1}{y^2}[/math]

And finally, using the chain rule:

[math]\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} * \frac{\partial y}{\partial x}[/math]

Notice we can calculate the partials for each node as we’re evaluating that node.

This means we can do something like this:

From x, calculate y and also dy/dx
From y, calculate z and also dz/dy
Calculate dz/dx by accumulating the partials “along the way” from z to x.

This is an important capability as we’ll see.

Anyway, the above example can be translated into code like this:

Note: The auto_diff method gives a more precise result than num_diff in the above example. This is usually the case as the numerical diff method is using an approximation that gets better as we take smaller values of h but that just introduces imprecision as it’s then prone to rounding errors. [2]

Why Automatic Differentiation over Numerical Differentiation in Deep Learning?

Firstly, Numerical Differentiation is impractical for even “Shallow” learning. Say, you have a small network like this:

Here, the function is [math]f_W (A, B) = O_1[/math].

You want to figure out how the output (0) [3] changes when we change the weights (the Ws). How would we do that with numerical differentiation?

Well, for each W, we’d have to calculate [math]f_W [/math] after tweaking that W by a small amount. So, we’d be evaluating [math]f_W [/math] 14 times above (once for each one of the 13 weights and once for the base case). And for all that effort, we’d end up with somewhat imprecise values of the partials.

Instead what’s usually done is that [math]f_W[/math] is evaluated once (the “forward” pass) and then the partials/gradients [4] are accumulated in the “backward” pass. This can be done because as seen previously each node can figure out its “local” gradient [5] during the “forward” pass, and then use that knowledge to pass on the gradient in the backward pass.

That’s where Automatic Differentiation can come in to make things easy.

So, Deep Learning libraries like Tensorflow specify primitive operations that can be combined to form a variety of (usually pretty complex) computational graphs.

Essentially, they are saying: “Use these operations as the building blocks of your computation, and we’ll handle the nitty-gritty of finding the partials, etc.“

Footnotes

[1] The two operations here are pretty arbitrary but in reality standard basic operations (add, subtract, etc) and those that are widely used (like activation functions) are chosen as the “primitive” building blocks. You can see the registered gradients here.

[2] Numerical Differentiation can be useful for Gradient Checking though.

[3] Actually, we’re interested in seeing how the Cost/Loss function changes with respect to the parameters, which should depend on the output. So, for simplification, let’s say output.

[4] I couldn’t decide whether to use “partials” (a simpler term) or “gradients” (perhaps, better suited in context).

[5] I tried to not add in backpropagation in the mix. But, I guess, that was a failed attempt. Anyway, understanding of backpropagation (and viewing the net as a computational graph) can help a lot to see the bigger picture.

4.1K views ·

View upvotes

View 4 shares

1 of 1 answer

Something went wrong. Wait a moment and try again.

View question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·