Dmitriy Genzel's answer to How are RNNs storing 'memory'?

Ph.D. in Computer Science, Brown University (Graduated 2005) · Upvoted by

, M.Sc. Computer Engineering & Machine Learning, Politecnico Di Milano (2006) and

, Master Degree Computer Science, Universitas Indonesia (2022) · Author has 2.9K answers and 35.3M answer views · 7y ·

This is a great question. I just recently talked with someone who didn’t understand what was going on, and unfortunately there was no time for me to go into it, so I’ll try to do this now.

The person I talked to said something like this:

LSTMs have memory, and plain RNNs don’t. This means that LSTMs can remember something, and RNNs can’t.

This is either very wrong, or basically right, depending on your point of view.

Let me first explain why this is very wrong. If you actually try to implement RNNs and LSTMs, you realize that the way they work is by unrolling them in time. This lets you forget about “memory” or “state”, and it just becomes a standard feed-forward network.

(Picture from: Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs)

In this picture there are some transformations U, V, W, and for different types of networks these may or may not exist. In particular, for classical RNNs, which ones are missing? If you answered V, you are correct. RNNs compute the output and pass that same output to the next time step, in other words they have [math]s_t = o_t[/math]. Some people say that classical RNNs have no state, but this is obviously wrong, they do have state, that’s what you pass to the next time step. Before LSTMs appeared people mostly thought that state and output was the same thing, and they tended to refer to this object as the output at time t, not state at time t. Notice that the V part is actually outside the network itself, it’s not recurrent at all. That means that you can have a traditional RNN that can model basically the same things that an LSTM can, if you could supply the correct parameters (more on that later).

When the LSTMs came on the scene, they went from a state that was one item to a state that had two, one we can refer to as “cell” and the other as “output”.

(From highly recommended Understanding LSTM Networks)

In this way of looking at it obviously it didn’t add any “memory”, it just passed two signals to the next step instead of one. But what it did is amazing nonetheless: if you look at the top line (cell), you can see how easy it is to have a situation in which [math]c_t = c_{t-1}[/math] (or for any subset of dimensions within [math]c_t[/math]). You can think of that as remembering something over time. However, let me emphasize again, even in a traditional RNN you can set things up so that the “state” (or a subset of it) is preserved over time, just like with LSTM.

In case you think that having that special “memory”-like thing is really fundamental, let me show you a GRU:

Here there’s exactly one state, just like in a classical RNN, but the transformation is very complicated. Does this have “memory”? Well, yes and no. Its output is its state like in an RNN and there’s nothing extra, but unlike in an RNN it is easy to preserve the value, so its output is memory-like. GRUs are very effective in many situations and can often model memory-like phenomena just as well as LSTMs.

So the person I spoke to was completely wrong. However, he was also right. Why? RNNs can model memory-like phenomena, if you supply the right weights. But the point of ML is, of course, can it learn those weights? It turns out to be impractical. It is quite complicated to model preserving a value as a state in a classical RNN, in part because of the vanishing gradient problem, and in part simply because the activation function tends to mess things up. But often in order to model your problem you want memory-like behavior. If you use LSTMs or GRUs you get this behavior naturally.

In other words, if you want your recurrent network to behave as if it can remember things, you should use LSTMs and GRUs, and not traditional RNNs. But theoretically the latter are just as capable. None of these have “memory” (or all do), but the former two are capable of easily preserving a state as it passes through a step, and we can think of that as remembering, and the latter has trouble with it.

37.2K views ·

View upvotes

View 5 shares

1 of 3 answers

Something went wrong. Wait a moment and try again.

View 2 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·