HyperAI

Long Short-Term Memory

Long and short-term memory(English: Long Short-Term Memory, LSTM) is a time recurrent neural network (RNN), the paper was first published in 1997. Due to its unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.

LSTM usually performs better than time-recursive neural networks and hidden Markov models (HMM), for example, in non-segmented continuous handwriting recognition. In 2009, an artificial neural network model built with LSTM won the ICDAR handwriting recognition competition. LSTM is also widely used in autonomous speech recognition, and in 2013, it achieved a record of 17.7% error rate using the TIMIT natural speech database. As a nonlinear model, LSTM can be used as a complex nonlinear unit to construct larger deep neural networks.

Long and short-term memory(English: Long Short-Term Memory, LSTM) is a time recurrent neural network (RNN), the paper was first published in 1997. Due to its unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.

LSTM usually performs better than time-recursive neural networks and hidden Markov models (HMM), for example, in non-segmented continuous handwriting recognition. In 2009, an artificial neural network model built with LSTM won the ICDAR handwriting recognition competition. LSTM is also widely used in autonomous speech recognition, and in 2013, it achieved a record of 17.7% error rate using the TIMIT natural speech database. As a nonlinear model, LSTM can be used as a complex nonlinear unit to construct larger deep neural networks.

Understanding LSTM Networks

Recurrent Neural Networks

Humans don't think from scratch all the time. If you're reading this article, you're understanding each word based on the previous words, and you don't need to throw everything away and start thinking from scratch. Your thinking has continuity.

Traditional neural networks can’t do this, and this is a major drawback. For example, imagine you need to make a judgment about what is happening in a movie. It’s unclear how traditional neural networks can infer what will happen next based on what happened before.

Recurrent neural networks are used to solve this problem. There are loops inside recurrent neural networks to maintain the continuity of information.

In the figure above, there is a local neural network - ????A, input value ????????xt, and output value ℎ????ht. A loop ensures that information is passed step by step in the network.

These loops make recurrent neural networks difficult to understand. However, if you think about it, they are no different than regular neural networks. A recurrent neural network can be thought of as a set of identical networks, each passing information to the next. If you unroll the loops, you will see:

This chain structure naturally reveals that recurrent neural networks are closely related to sequences and lists. This is a natural architecture for neural networks to process sequence data.

Of course, it is available. In recent years, RNNs have achieved incredible success in speech recognition, language modeling, translation, image description, and many other fields. I leave a discussion of the achievements of RNNs in Andrej Karpathy's blog. RNNs are really amazing!

The key to these successes is the “LSTM” — a special type of recurrent neural network that performs much better than the standard RNN on many problems. Almost all of the great results achieved with recurrent neural networks are due to the use of LSTM. This post is about LSTM.

Long-term dependency issues

One of the appeals of RNNs is their ability to connect previous information to the current problem, such as using previous video frames to inspire understanding of the current frame. If RNNs could do this, they would be very useful. But can they? Well, that's conditional.

Sometimes we only need to look at recent information to solve a current problem. For example, a language model tries to predict the next word based on previous words. If we try to predict “the clouds are in the sky" We don't need any further context, and it's clear that the next word will be sky In this case, the distance between the relevant information and where it is located is small, and the RNN can learn to use past information.

But there are also cases where we need more context. Consider trying to predict “I grew up in France… I speak fluent French.” The nearest information suggests that the next word is probably the name of a language, but if we want to be specific about which language, we need context from further away — France . Therefore, it is entirely possible that the distance between the relevant information and where it is located is very large.

Unfortunately, as the distance increases, the RNN starts to fail to connect the information.

In theory, RNNs are absolutely capable of handling such "long-term dependencies". Humans can solve "toy problems" of this form by carefully selecting parameters. Unfortunately, in practice, RNNs seem unable to learn them. This problem was explored in depth by Hochreiter and Bengio et al. He discovered the root cause of the difficulty of the problem.

Thankfully, LSTMs don’t have this problem!

LSTM Network

Long Short-Term Memory Networks — often referred to as LSTMs, are a special type of RNN that is able to learn long-term dependencies. They were proposed by Hochreiter and Schmidhuber (1997) and improved and generalized by many others in subsequent work. LSTMs perform very well on a wide variety of problems and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is actually the default behavior of LSTMs, not something that needs to be learned hard!

All recurrent neural networks have a chain-like repeating module of neural networks. In a standard RNN, this repeating module has a very simple structure, such as just a single tanh layer.

LSTMs have this similar chain-like structure, but the repeating modules have a different structure. Instead of a single neural network layer, there are four, and they interact in a very specific way.

Don't worry about the details. We'll walk through a diagram of an LSTM later. For now, let's try to get familiar with the notation we'll be using.

In the diagram above, each row contains a complete vector, from the output of one node to the input of other nodes. Pink circles represent point-by-point operations, such as vector addition, while yellow boxes represent learned neural network layers. Rows merging represent concatenation, while forking means their contents are being copied and the copies are going to different locations.

The core idea of LSTM

The key to LSTM is the cell state, which is the upper horizontal line in the figure.

The cell state is a bit like a conveyor belt. It runs through the entire chain with only some minor linear interactions. Information flows easily through it in an unchanged manner.

LSTMs can add or remove information to the cell state through an elaborate structure called “gates”.

Gates can selectively let information through. They consist of sigmoid neural network layers and point-wise multiplication operations.

The output of the S-shaped network is a value between 0 and 1, indicating what proportion of the information has passed through. A value of 0 means "no information has passed through", and a value of 1 means "all information has passed through".

An LSTM has three such gates used to maintain and control the cell state.

Step-by-step analysis of the LSTM process

The first step of the LSTM is to decide what information to discard from the cell state. This decision is made by a sigmoid layer called the "forget gate layer". It takes in ℎ????−1ht−1 and ????????xt and outputs a value between 0 and 1 for each number in the cell state ????????−1Ct−1. 1 means "completely accept this" and 0 means "completely ignore this".

Let's go back to the example of a language model trying to predict the next word given the previous one. In this problem, the cell state might include the part of speech of the current subject so the correct pronoun can be used. When we see a new subject, we need to forget the part of speech of the previous subject.

The next step is to determine what new information needs to be saved in the cell state. There are two parts to this. In the first part, an S-shaped network layer called the "input gate layer" determines what information needs to be updated. In the second part, a tanh network layer creates a new candidate value vector - ????̃ ????C~t, which can be used to add to the cell state. In the next step, we combine the above two parts to produce an update to the state.

In our language model, we need to add the part of speech of the new subject to the state, replacing the old subject that needs to be forgotten.

Now update the old cell state ??????????−1Ct−1 to ????????Ct . The previous step has already decided what to do, we just need to do it.

We multiply the old state by ????????ft to forget what we decided to forget. Then we add ????????∗????̃ ????it∗C~t, which are the new candidate values, scaled by the update we decided on for each state.

In the case of the language model, this is where we discard the part of speech of the old subject and add the part of speech of the new subject based on the previous steps.

Finally, we need to determine our output. The output depends on our cell state, but it will be a "filtered" version. First we run a sigmoid layer to determine which parts of the cell state we want to output. Then we feed the cell state into tanhtanh (scaling the values to between −1−1 and 11) and multiply it by the output of the sigmoid layer so that we can output the parts we want.

Taking language model as an example, once a subject appears, the information of the subject will affect the verb that appears later. For example, if you know whether the subject is singular or plural, you can know the form of the subsequent verb.

Variations of long short-term memory

So far I have described a fairly general LSTM network. But not all LSTM networks are the same as described above. In fact, almost all articles will improve on a specific version of an LSTM network. The differences are minor, but it is worth being aware of the variants.

A popular LSTM variant by Gers and Schmidhuber adds a “peephole connection” to the LSTM, which means that we can allow the gate network layer to input the cell state.

In the image above we added peepholes to all doors, but many papers only add peepholes to some doors.

Another variation combines the forget gate with the input gate. The information to be forgotten and the new information to be added are determined at the same time, rather than separately. The information is forgotten when the input is started, and the new data is entered when the old information is forgotten.

A more interesting variation of the LSTM is called the Gated Recurrent Unit (GRU), proposed by Cho et al. The GRU combines the forget gate and the input gate into a single "update gate", combines the cell state with the hidden state, and other changes. This makes the GRU simpler than the standard LSTM model, and is therefore becoming popular.

These are just a few of the well-known LSTM variants. There are other variants, such as the Depth Gated RNN proposed by Yao et al. There are also completely different approaches to long-term dependencies, such as the Clockwork RNN proposed by Koutnik et al.

Which of these variants is the best? Do the differences between them matter? Greff et al. did a study that carefully compared popular variants and found that they were almost the same. Jozefowicz et al. tested more than 10,000 RNN architectures and found that some architectures performed better than LSTM on specific problems.

in conclusion

Early on, I noticed that some people were getting great results with RNNs, and they were almost always using LSTM networks. For most problems, LSTMs are just better!

After listing a bunch of formulas, LSTM looks daunting. Fortunately, the step-by-step explanation in the article makes them more approachable.

LSTM is a big step forward for RNNs. It’s natural to ask: is there room for more progress? The general answer from researchers is: Yes! There is room for improvement, and that is attention! The idea of attention is to have each step in the RNN extract information from a place where the information is richer. For example, if you want to use an RNN to generate a description of an image, it also needs to extract part of the image to generate the output text. In fact, Xu et al. did this, which is a pretty good starting point if you want to explore attention. There are many other excellent results using attention, and attention will be even more powerful in the future...

Attention is not the only exciting idea in RNN research. Grid LSTM by Kalchbrenner et al. looks very promising. The ideas of Gregor et al., Chung et al., or Bayer and Osendorfer to use RNNs in generative models are also very interesting. The last few years have been a star period for recurrent neural networks, and new results will only be more promising.

Reprinted from https://www.cnblogs.com/xuruilong100/p/8506949.html