What is Long Short-Term Memory (LSTM): Artificial Intelligence Explained




A neural network with interconnected nodes

Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a “general purpose computer” (like a Turing machine) and can process not only single data points (such as images), but also entire sequences of data (such as speech or video).

Its name indicates the main characteristic of this architecture: it can remember patterns over time (long-term) but also keep only relevant information, forgetting the non-relevant ones (short-term). This makes LSTM particularly effective for tasks that involve sequential data with long-term dependencies, such as time series prediction, natural language processing, and speech recognition.

History of LSTM

The LSTM model was first introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997. The two researchers were trying to solve the problem of long-term dependencies, a common issue in training traditional RNNs. When the gap between the relevant information and the point where it is needed becomes too large, RNNs become unable to learn to connect the information.

In their groundbreaking paper “Long Short-Term Memory”, Hochreiter and Schmidhuber proposed a solution to this problem. They introduced a new type of RNN that could learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through “constant error carrousels” (CECs) within special units, called cells.

Evolution of LSTM

Since its introduction, LSTM has undergone several modifications and improvements. In 2000, Felix Gers and his colleagues proposed adding a “forget gate” to the model, allowing the LSTM cell to learn when to forget its stored state. This was a significant improvement as it allowed the model to decide how long to remember information.

In 2014, Kyunghyun Cho and his colleagues proposed a simplified version of LSTM, called Gated Recurrent Unit (GRU). GRU combined the forget and input gates into a single “update gate” and merged the cell state and hidden state. Although GRU has fewer parameters and is faster to compute, it is not always clear whether it performs as well as LSTM on complex tasks.

Architecture of LSTM

The architecture of LSTM is composed of a set of recurrently connected blocks, or cells. Each cell has three multiplicative units that interact in a special way to control the flow of information. These units are often referred to as “gates” because they can allow or prevent information from passing through, much like a physical gate.

The three gates in an LSTM cell are: the input gate, which controls the extent to which a new value flows into the cell; the forget gate, which controls the extent to which a value remains in the cell; and the output gate, which controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit.

Input Gate

The input gate determines how much of the incoming information should be stored in the cell state. It uses a sigmoid function to scale the values between 0 and 1. A value of 0 means “let nothing through”, while a value of 1 means “let everything through”.

Alongside the input gate, there’s a tanh layer that creates a vector of new candidate values, C_tilde, that could be added to the state. In the next step, the LSTM decides what parts of this candidate state should be added to the cell state.

Forget Gate

The forget gate decides what information should be discarded from the cell state. It also uses a sigmoid function to output values between 0 and 1. If the forget gate outputs a 0, it means “completely ignore this”, and if it outputs a 1, it means “completely pay attention to this”.

After the forget gate, the LSTM takes the old cell state, C_{t-1}, multiplies it by the output of the forget gate, and adds the result to the output of the input gate. This results in the new cell state, C_t.

Output Gate

The output gate decides what the next hidden state should be. This hidden state will be used in the prediction task, and will be passed to the next LSTM cell. The output gate takes the current input and the previous hidden state, passes them through a sigmoid function, and multiplies the output by the cell state passed through a tanh function.

The result is the new hidden state, h_t. This hidden state can then be used to compute the output of the LSTM, or it can be passed to the next LSTM cell.

Training LSTM

Training an LSTM network is similar to training a traditional neural network. It involves presenting the network with inputs and desired outputs (known as targets), and adjusting the weights of the network to minimize the difference between the network’s outputs and the targets. This process is known as backpropagation.

However, unlike traditional neural networks, LSTMs have a complex structure with multiple interacting layers, which makes the backpropagation process more complicated. This is where the concept of “backpropagation through time” (BPTT) comes in. BPTT is an extension of the backpropagation algorithm for feedforward neural networks to handle recurrent neural networks.

Backpropagation Through Time (BPTT)

BPTT works by unrolling all input sequences over time, applying the standard backpropagation algorithm, and then rolling the sequences back up. This process allows the error to be propagated back through time, from the output end of the sequence to the input end, allowing the LSTM to learn from the temporal dynamics of the sequence.

However, BPTT has its own challenges. One of the main problems is the so-called “vanishing gradient” problem, where the gradients of the loss function become too small as they are propagated back in time. This makes the weights of the network hard to update and can slow down the learning process. LSTM networks mitigate this problem with their gating mechanism, which allows them to selectively forget or remember information, thus controlling the flow of gradients.

Gradient Clipping

Another technique often used in training LSTM networks is gradient clipping. This is a technique used to prevent the “exploding gradients” problem, where the gradients of the loss function become too large and cause the learning algorithm to fail.

Gradient clipping works by setting a threshold value, and if the gradient of the loss function exceeds this threshold, it is set to the threshold. This effectively limits the maximum value of the gradient and helps to stabilize the learning process.

Applications of LSTM

Section Image

LSTM networks have been successfully applied to a wide range of tasks that involve sequential data. They have achieved state-of-the-art results on tasks such as speech recognition, language modeling, translation, image captioning and more.

One of the key advantages of LSTM networks is their ability to learn from long sequences of data, making them particularly effective for tasks that involve temporal dependencies. This makes them a popular choice for many applications in the field of artificial intelligence.

Speech Recognition

In speech recognition, LSTM networks are used to model the temporal dynamics of speech. They can learn to recognize patterns in the audio signal that correspond to spoken words or phrases, and can even learn to recognize the speaker’s identity or emotional state.

One of the challenges in speech recognition is dealing with variable-length input sequences. LSTM networks handle this by processing the input sequence one element at a time, and updating their internal state accordingly. This allows them to handle input sequences of any length.

Natural Language Processing

In natural language processing, LSTM networks are used for tasks such as language modeling, machine translation, and sentiment analysis. In language modeling, an LSTM network is trained to predict the next word in a sentence based on the previous words. This allows it to learn the syntax and semantics of the language, and can be used to generate new sentences in the language.

In machine translation, an LSTM network is used to translate a sentence from one language to another. The network is trained on pairs of sentences in the source and target languages, and learns to map the source sentence to the target sentence. This involves learning both the syntax and semantics of both languages, as well as the mapping between them.

Time Series Prediction

LSTM networks are also used for time series prediction, where the goal is to predict future values of a time series based on past values. This is a common task in many fields, including finance, economics, and meteorology.

One of the challenges in time series prediction is dealing with temporal dependencies. LSTM networks handle this by maintaining an internal state that can remember past values, allowing them to learn and model temporal dependencies in the data.


Long Short-Term Memory (LSTM) is a powerful and flexible architecture for recurrent neural networks, capable of learning long-term dependencies in data. Since its introduction in 1997, it has been continually improved and adapted, and has been applied to a wide range of tasks in artificial intelligence.

Despite the complexity of the LSTM architecture and the challenges involved in training LSTM networks, they have proven to be a valuable tool in the field of deep learning. With their ability to process and learn from sequential data, LSTM networks have opened up new possibilities for the development of intelligent systems.

Share this content

Latest posts