LSTMs from scratch

2026-06-26

A patient, ground-up build of the LSTM — from a single neuron to a recurrent circuit with memory. What the four gates do, why we add the cell state, and what the network actually learns to produce.

There is a diagram of an LSTM that almost everyone has seen — a box with three or four sigmoid blobs inside, a tangle of arrows looping back on themselves, two lines running across the top and bottom, little × and + symbols scattered around. The first time you see it, it looks like the schematic of something you are not qualified to touch. It looks like wiring.

By the end of this article that diagram will look like what it actually is: four small neural layers and about six arithmetic operations. Nothing more. No prerequisites beyond knowing what a vector and a matrix are. We are going to start one neuron at a time and build up until the tangle untangles itself. If you want the companion piece on the architecture that eventually replaced LSTMs for most tasks, that is bits2bricks/Transformers from scratch — but the LSTM is where the key idea of a learned memory was first made to work, and it is worth understanding on its own.

1. Learning from data, in one neuron

Forget networks for a moment. The whole field rests on one object, and it is almost embarrassingly simple.

You have some numbers describing a thing — the pixels of an image, the readings of a sensor, the word counts of an email. Call that list of numbers a vector, x=(x1,x2,…,xn)x = (x_1, x_2, \dots, x_n)x=(x1,x2,…,xn). You want to map it to a decision: spam or not, cat or dog, buy or sell.

The oldest trick in the book is the perceptron 1. You give each input its own weight wiw_iwi, multiply each input by its weight, add them all up, and add one more number called the bias bbb:

1Frank Rosenblatt, 1958. He built it as physical hardware — the Mark I Perceptron — a room-sized machine with 400 photocells wired to motor-driven potentiometers that adjusted the weights. The New York Times reported it would soon "walk, talk, see, write, reproduce itself and be conscious of its existence." It could not do any of that. But the core math survived intact into every network running today.

z=w1x1+w2x2+⋯+wnxn+b=w⋅x+bz = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b = \mathbf{w} \cdot \mathbf{x} + bz=w1x1+w2x2+⋯+wnxn+b=w⋅x+b

That w⋅x\mathbf{w} \cdot \mathbf{x}w⋅x is a dot product — the same operation that turns out to be at the heart of attention, of similarity, of almost everything. It measures how much the input lines up with the weight vector. If the sum zzz clears some threshold, the neuron says "yes" (1); otherwise it says "no" (0).

Here is the geometric picture, which is the one to actually keep. The equation w⋅x+b=0\mathbf{w} \cdot \mathbf{x} + b = 0w⋅x+b=0 defines a flat boundary — a line in 2D, a plane in 3D, a hyperplane in higher dimensions. Everything on one side of it is a "yes," everything on the other side is a "no." The weights tilt and orient that boundary. The bias slides it back and forth.

So what does it mean for this thing to learn? Nothing mystical. You show it an example, it makes a guess, you check the guess against the true answer, and if it was wrong you nudge every weight a tiny bit in the direction that would have made it less wrong. Do that across thousands of examples and the boundary drifts until it lands somewhere that separates the classes. The line is not programmed. It is found, by repeated correction.

That is the entire seed of machine learning. A parametrized function, a measure of how wrong it is, and a rule for nudging the parameters downhill. Everything else — every architecture in this article and the next — is that same loop, scaled up and wired into more interesting shapes.

But one neuron has a hard ceiling. A single hyperplane can only separate things that are linearly separable. The classic counterexample is XOR 1: four points where no straight line can carve the yeses from the noes. To draw a curved boundary, you need more than one line. You need a layer.

1Minsky and Papert's 1969 book Perceptrons proved a single-layer perceptron cannot learn XOR — a function where the answer is "yes" only when the two inputs disagree. No straight line separates those four points. The result was read as a death sentence for the whole approach and is often blamed for the first "AI winter." The fix was already conceptually obvious — stack more neurons — but training stacked neurons efficiently took another decade and a half.

2. A layer is just many neurons at once

Take the single neuron and make copies of it — say, 64 of them — all looking at the same input xxx, but each with its own weights and its own bias. Each one draws its own hyperplane. Each one outputs a single number. Together they turn the input vector into a new vector of 64 numbers.

That bundle is a layer, and if you write all 64 weight vectors as the rows of a single matrix WWW, the whole layer collapses into one clean expression:

h=σ ⁣(Wx+b)\mathbf{h} = \sigma\!\left( W \mathbf{x} + \mathbf{b} \right)h=σ(Wx+b)

A matrix multiply, a bias added, and then σ\sigmaσ — a nonlinearity applied to each number. This is the part that matters and the part beginners skip. Without the nonlinearity, stacking layers buys you nothing: two matrix multiplies in a row are just one bigger matrix multiply, so a hundred linear layers collapse back into a single line. The nonlinearity — historically a sigmoid that squashes everything into the range 0 to 1, today usually a ReLU that just clips negatives to zero — is what lets each layer bend the space instead of only tilting it. Bend the space enough times and you can separate anything.

A layer — A linear transformation Wx+bW\mathbf{x} + \mathbf{b}Wx+b followed by a nonlinearity σ\sigmaσ. Vector in, vector out.

A deep network — Several layers stacked, each one feeding the next. The output of layer 1 is the input to layer 2.

A dense (or fully-connected) layer — Every input connects to every neuron. WWW has no zeros forced into it; everything talks to everything.

Stack a few of these and you have a multi-layer perceptron, the plain-vanilla neural network. Feed in a vector at the bottom, let it flow up through the layers, read a prediction off the top. The magic theorem here is that a network like this is a universal approximator 1 — with enough neurons it can fit essentially any function from input to output.

1The universal approximation theorem (Cybenko 1989, Hornik 1991) says a network with even a single hidden layer can approximate any continuous function to arbitrary precision, given enough neurons. It is a reassuring existence proof and a practically useless recipe: it tells you a good setting of the weights exists, not how to find it, nor how many neurons "enough" is. Depth, in practice, is what makes the search tractable.

How the weights actually get set is its own deep topic — the algorithm is backpropagation paired with gradient descent, and it is the same engine that trains every model in this article. For our purposes the one-sentence version is enough: run an example through, measure how wrong the output was, and push every weight a hair in the direction that reduces the error. Repeat a few billion times.

But here is the reframe the rest of the article quietly depends on, and the thing most explanations skip right past. When you read "an LSTM" or "a transformer," it is tempting to picture a bigger version of this stack — one tall tower of layers, input at the bottom, answer at the top. That picture is wrong, and it is exactly why these models look so mysterious from the outside.

What you actually have is a small handful of those σ(Wx+b)\sigma(W\mathbf{x}+\mathbf{b})σ(Wx+b) blocks — sometimes just three or four — wired together in a fixed, deliberate pattern. Some blocks' outputs get multiplied together, some get added, some get fed back in as inputs on the next step. The blocks are interchangeable and dumb; the wiring is the architecture. The idea lives in the topology, not in the layers.

It helps to think of it as a differentiable circuit. In ordinary electronics you wire up logic gates — AND, OR, NOT — and the pattern of connections computes something; here the "gates" are neural layers and the wires carry vectors of real numbers, but the spirit is identical — a fixed graph of operations, designed by hand, with learnable parameters tucked inside the blocks. The designer chooses the wiring; training chooses the weights.

It is why two models built from the very same primitive can behave like completely different machines. A plain MLP wires its layers in a straight line: data flows up, once, and leaves. An LSTM takes those same layers and wires them into a loop, with a line of memory threaded through it. Same Lego bricks. Wildly different toy.

A plain feedforward network wires its layers in a straight line: data flows up, once, and leaves. Nothing connects the output back to the input, so it has no memory of what it saw a step ago.

The thing to hold ontoA neural layer is not exotic. It is a matrix multiply plus a squash. That is the only computational primitive in this entire article. bits2bricks/Attention layers, LSTM gates, the readout head at the very end — all of them are this same operation with different shapes and different wiring. When the diagrams get scary later, remember that every box inside them is just σ(Wx+b)\sigma(W\mathbf{x} + \mathbf{b})σ(Wx+b).

· · ·

That is the foundation: a neuron is a hyperplane, a layer is a matrix-multiply-plus-squash, a deep network stacks them, and an architecture is a circuit you build out of those layers by hand. Next we wire our first loop — and run straight into the problem that the LSTM was invented to fix.

3. Wiring the first loop

The plain network we just built has a blind spot that disqualifies it for an entire universe of problems: it has no memory, and no sense of order.

Feed it a movie review and it sees a bag of words with no before-and-after. Feed it today's stock price and it has already forgotten yesterday's. Each input is processed alone, from scratch, as if nothing came before it. For a photo, fine — a cat is a cat regardless of what you showed the network last. But language, audio, sensor streams, prices, DNA — these are sequences, where the meaning of the thing you are looking at right now depends on the things that came before it. The word "bank" means something different after "river" than after "savings." A network with no memory cannot use that.

So we change the wiring. We take a dense layer and feed its own output back into itself on the next step. At each timestep ttt the cell looks at two things: the new input xtx_txt, and its own state from the previous step, ht−1h_{t-1}ht−1. It mixes them and produces a new state:

ht=tanh⁡ ⁣(Wx xt+Wh ht−1+b)h_t = \tanh\!\left( W_x\, x_t + W_h\, h_{t-1} + b \right)ht=tanh(Wxxt+Whht−1+b)

That is the entire recurrent neural network. One dense layer with a loop. The vector hth_tht is the hidden state — a running summary of everything the network has seen so far, squeezed into a fixed-size vector and carried forward. Read a word, update the summary. Read the next word, update it again. The summary is the memory.

If you "unroll" the loop in time — draw the cell once per timestep, in a row — it looks like a deep feedforward network, one layer per word in the sentence:

Unrolled in time, the loop looks like a deep feedforward network — one copy of the same cell per timestep, each handing its hidden state to the next.

And here is the first surprising payoff, the one you flagged: there is almost nothing to train. Look at the unrolled picture. It might be 500 cells deep for a 500-word document, but every one of those cells is the same cell. The weights WxW_xWx, WhW_hWh, and bbb are shared across all timesteps. A 500-step sequence does not need 500 sets of parameters; it needs one, reused 500 times.

This is weight sharing, and it is not a cost-saving hack — it is the whole point. It bakes in an assumption about the world: the rule for combining "what's happening now" with "what I remember" is the same at every moment in time. The grammar that connects word 2 to word 1 is the same grammar that connects word 400 to word 399. By forcing one set of weights to handle every position, the network is made to learn a general update rule instead of memorizing position-by-position quirks. It is also what lets a single trained model swallow a sequence of any length — 5 steps or 5,000 — without changing shape. A handful of weights, applied over and over, is the entire machine.

A six-word sentence run through one RNN cell is already a six-layer-deep computation, even though you only ever wrote down one layer's worth of weights. Depth in a recurrent network comes from time, not from stacking — and that is the source of both its power and, as we are about to see, its central disease.

4. Why the simple loop forgets

The RNN looks like it should work. It has memory, it shares weights, it handles any length. For short sequences it does work. And then you give it a long one, and it falls apart in a very specific, very instructive way.

Go back to the unrolled picture and ask how this thing learns. To train it, the error measured at the end of the sentence has to travel backward through every cell — through every timestep — to tell the early weights how they should have behaved. That backward signal is the gradient, and the chain rule says that to send it from step 100 back to step 1, you multiply it by the cell's local derivative once per step. A hundred steps back means roughly a hundred multiplications by the same recurrent matrix WhW_hWh and the same squashing derivative.

Now think about what happens when you multiply a number by itself a hundred times.

If the factor is a little less than 1 — say 0.9 — then 0.9100≈0.000030.9^{100} \approx 0.000030.9100≈0.00003. The signal vanishes. By the time the error from step 100 reaches step 1, it is so close to zero that the early weights feel nothing. The network literally cannot tell that word 1 mattered.

If the factor is a little more than 1 — say 1.1 — then 1.1100≈13,7801.1^{100} \approx 13{,}7801.1100≈13,780. The signal explodes. The gradient blows up, the weights lurch, training diverges into NaNs.

This is the vanishing gradient problem (and its evil twin, the exploding gradient), and it is not a bug you can patch. It is a structural consequence of pushing a signal through the same multiply-and-squash a hundred times in a row. Exploding gradients you can crudely tape over by clipping them when they get too big. Vanishing gradients are worse, because there is nothing to clip — the information is simply gone.

The practical effect: a vanilla RNN has a memory of maybe 5 to 10 steps. Past that, the past is fog. Consider the sentence:

the cats that the dog chased all day were exhausted

To choose "were" over "was," the network has to remember, across the entire intervening clause, that the subject was "cats" (plural), not "dog" (singular). That dependency spans eight words. A simple RNN, with its gradient already decayed to nothing by then, will happily write "was" — it has forgotten what the sentence was even about. Anything that requires connecting something early to something late — the subject of a long sentence, the opening premise of an argument, a plot point from chapter one — is beyond it.

So we are stuck between two failures. Multiply by something under 1 and the memory evaporates. Multiply by something over 1 and the whole thing detonates. Is there a way to carry information across many steps without multiplying it by anything at all?

That question, asked in 1997 1, is the entire reason the LSTM exists. The answer is the diagram you sent — and it is, at heart, one beautifully simple idea: build a path through time where the signal is added to, not multiplied through. Let's wire it.

1Sepp Hochreiter and Jürgen Schmidhuber, "Long Short-Term Memory," Neural Computation 9(8). The forget gate — arguably the most important part of the modern cell — was not in the original 1997 design. It was added three years later by Gers, Schmidhuber and Cummins (2000), because the original cells would let their memory grow without bound and never reset. The version everyone uses today is really the 2000 version.

5. The LSTM cell

The fix starts by splitting memory into two separate lines instead of one.

A vanilla RNN has a single state vector hhh that has to do everything at once: be the long-term memory and be the output the rest of the network reads. The LSTM separates those jobs. It keeps the hidden state hth_tht as the working output, and adds a second, private line: the cell state ctc_tct. The cell state is the long-term memory — a vector that runs straight across the top of the diagram, from ct−1c_{t-1}ct−1 on the left to ctc_tct on the right, barely touched. Think of it as a conveyor belt carrying the network's notes forward through time. Nothing multiplies it by a weight matrix. Things are only gently erased from it and added to it, by small valves called gates.

Here is the whole cell. Everything below is just naming the pieces.

The whole machine. Four little layers reading the same input z, and a memory line (top) that is only ever multiplied by a forget valve and added to. That additive top path is the entire trick.

One input for everything: z

Look at the bottom of the diagram. The previous hidden state ht−1h_{t-1}ht−1 and the current input xtx_txt get stuck together, end to end, into a single longer vector. Call it zzz:

zt=[ ht−1, xt ]z_t = [\,h_{t-1},\; x_t\,]zt=[ht−1,xt]

That is it — concatenation, no math. This combined vector is the single point of entry for the entire cell. Every gate, every decision the cell makes, is computed from this same zzz. It is the cell's complete view of the world at time ttt: "here is what I remember (ht−1h_{t-1}ht−1), and here is what just arrived (xtx_txt)." Now four little networks read that view, each asking a different question about it.

Four small layers reading the same view

This is the payoff of everything in the foundation. The four yellow boxes in the diagram are not exotic. Each one is exactly the σ(Wz+b)\sigma(Wz + b)σ(Wz+b) layer from the very first sections — a dense layer, a matrix multiply and a squash. Same primitive, four copies, each with its own learned weights, all reading the same zzz:

ft=σ(Wf zt+bf)forget gateit=σ(Wi zt+bi)input gateC~t=tanh⁡(Wc zt+bc)candidateot=σ(Wo zt+bo)output gate\begin{aligned} f_t &= \sigma(W_f\, z_t + b_f) &&\text{forget gate} \\ i_t &= \sigma(W_i\, z_t + b_i) &&\text{input gate} \\ \tilde{C}_t &= \tanh(W_c\, z_t + b_c) &&\text{candidate} \\ o_t &= \sigma(W_o\, z_t + b_o) &&\text{output gate} \end{aligned}ftitC~tot=σ(Wfzt+bf)=σ(Wizt+bi)=tanh(Wczt+bc)=σ(Wozt+bo)forget gateinput gatecandidateoutput gate

Three of them use a sigmoid σ\sigmaσ, which squashes every number into the range 0 to 1. That range is deliberate: a sigmoid output is a valve setting. Zero means "let nothing through," one means "let everything through," and the values in between are partial openings — one valve per dimension of the memory. The fourth box uses tanh, which squashes into −1 to 1, because it is not a valve; it is proposed content, and content needs to be able to point in either direction.

So the cell's entire learnable substance is four weight matrices and four bias vectors. That is the "not much to train" from before, made concrete: a few hundred thousand numbers, give or take, reused at every single timestep of a sequence of any length. The sentence can be 5 words or 5,000; it is always these same four layers, firing again and again. Now watch what they do to the conveyor belt.

Step 1 — the forget gate decides what to erase

The cell state ct−1c_{t-1}ct−1 arrives from the left carrying everything the network has been remembering. The first thing that happens to it is a multiplication by the forget gate:

ft⊙ct−1f_t \odot c_{t-1}ft⊙ct−1

The ⊙\odot⊙ is an elementwise multiply — slot 1 of ftf_tft multiplies slot 1 of ct−1c_{t-1}ct−1, slot 2 times slot 2, and so on. Because every entry of ftf_tft sits between 0 and 1, this operation can only shrink the memory, never grow it. Where ftf_tft is near 1, that slot of memory passes through untouched. Where ftf_tft is near 0, that slot gets wiped to zero — forgotten.

This is the network deciding, per dimension, what is no longer worth carrying. When a sentence hits a full stop and a new subject begins, the forget gate is what lets the cell flush the old subject and make room. It learned, from data, when to let go.

Step 2 — the input gate decides what to write

Now we add new information. Two of the little layers cooperate here. The candidate C~t\tilde{C}_tC~t (the tanh box) proposes what could be written into memory — a vector of new content. The input gate iti_tit (a sigmoid) decides how much of that proposal actually gets written, slot by slot:

it⊙C~ti_t \odot \tilde{C}_tit⊙C~t

Same elementwise pattern. The candidate says "here is something I could store about what I just saw"; the input gate says "store this part of it, ignore that part." A proposal, and a valve controlling how much of the proposal lands. Splitting it this way means the cell can compute a rich candidate and still choose to write almost none of it — or all of it — depending on context.

Step 3 — update the belt: erase, then add

Here is the single most important line in the whole architecture. The new cell state is the old memory after forgetting, plus the new content after gating:

ct=ft⊙ct−1⏟keep + it⊙C~t⏟writec_t = \underbrace{f_t \odot c_{t-1}}_{\text{keep}} \;+\; \underbrace{i_t \odot \tilde{C}_t}_{\text{write}}ct=keepft⊙ct−1+writeit⊙C~t

Read it in plain language: take the old memory, erase the parts you've decided to forget, and add the parts you've decided to write. That is the entire memory update. An erase and a write, on a vector that otherwise just rides the belt forward.

And now the vanishing-gradient wall from the last section quietly falls. Notice what is not in this equation: there is no WWW multiplying ct−1c_{t-1}ct−1. The cell state is not pushed through a weight matrix and a squashing function on its way from one step to the next — it is only scaled by a number near 1 (the forget gate) and added to. When the gradient flows backward along this belt during training, it travels through additions and gentle multiplications instead of a hundred compounding matrix multiplies. The signal does not get exponentially crushed. An error at step 100 can reach step 1 with its message still legible. The belt is a gradient highway, and building that highway is the only reason the LSTM can remember things a vanilla RNN cannot.

Step 4 — the output gate decides what to reveal

The cell state ctc_tct is the full, private, long-term memory. We do not want to dump all of it out as this step's answer — most of it is bookkeeping the network is holding for later. So the last gate filters it into the hidden state:

ht=ot⊙tanh⁡(ct)h_t = o_t \odot \tanh(c_t)ht=ot⊙tanh(ct)

The cell state is first squashed by a tanh back into −1 to 1 (it can grow large after many additions; this reins it in). Then the output gate oto_tot — the fourth sigmoid valve — selects which parts of that squashed memory to expose. The result is hth_tht, the hidden state: the network's working readout for this timestep, and the thing that loops back around to become part of zzz on the next step.

So the two lines have two clearly different jobs:

The cell state ctc_tct — The long-term ledger. Private, rides the belt, updated only by erase-and-add. This is where memory actually lives.

The hidden state hth_tht — The working output. A filtered, on-demand view of the ledger — what the rest of the network gets to see, and what feeds back into the loop.

That is the entire cell. Four small layers reading one concatenated input zzz; a forget valve and a write valve editing a conveyor-belt memory by erasing and adding; an output valve deciding how much of that memory to reveal. The intimidating tangle of wires was always just this.

6. Why the belt actually remembers

Now we can say precisely what the LSTM bought us.

In the vanilla RNN, the memory had to pass through the same hostile tunnel at every step: a recurrent matrix multiply, a squash, another recurrent matrix multiply, another squash. The past was not carried forward; it was repeatedly transformed. And each transformation gave the gradient one more chance to shrink into nothing.

The LSTM changes the geometry of that path. The cell state does not get rewritten from scratch at every step. It moves forward by a simple recurrence:

ct=ft⊙ct−1+it⊙C~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{C}_tct=ft⊙ct−1+it⊙C~t

That equation is the whole invention. The old memory is not pushed through a dense layer. It is multiplied, slot by slot, by the forget gate, and then new content is added. So the backward path through the cell state is correspondingly simple:

∂ct∂ct−1=ft\frac{\partial c_t}{\partial c_{t-1}} = f_t∂ct−1∂ct=ft

This is the honest version of "long-term memory." The LSTM does not magically remember everything. It remembers what the forget gate has learned to keep near 1. If a slot of ftf_tft stays open across many steps, information and gradient can travel across those steps with far less damage. If it closes toward 0, that memory is deliberately erased. Long memory is not a property the cell always has; it is a behavior the gates learn.

That is why the 0-to-1 nature of the sigmoid gates matters. The gates are not producing content. They are producing decisions about preservation. Near 0 means "drop this." Near 1 means "carry this forward." The candidate vector proposes new information, the input gate decides whether to write it, the forget gate decides whether the old trace survives, and the output gate decides whether the private memory should be exposed as a hidden state. The whole cell is a small learned filing system: erase, write, reveal.

One more strange point is worth making explicit. The recurrent core still does not predict anything by itself.

The LSTM cell produces hth_tht, a hidden state. That hidden state is a representation — a compact summary of the sequence so far, filtered through the output gate. To turn it into an actual prediction, you place another trained layer on top:

y^t=g(Wyht+by)\hat{y}_t = g(W_y h_t + b_y)y^t=g(Wyht+by)

For a sequence-to-one problem, you might only read the final hidden state hTh_ThT. For a sequence-to-sequence problem, you might read every hth_tht. But in both cases, the prediction comes from a readout head. The loss is measured there, and the gradient flows backward from that readout into the LSTM gates that produced the state. The cell is not the oracle. It is the machine that learns what kind of memory would make the oracle possible.

7. When to reach for one — and when not

This gives us a better way to think about when LSTMs make sense.

An LSTM is useful when the past matters, but you do not already know how to summarize the past. If the important pattern is genuinely sequential — first this happens, then that fades, then a third signal rises, and the order matters — then a learned recurrent memory can be the right tool. This is why LSTMs were natural for language, audio, sensor streams, handwriting, EEG, machine telemetry: domains where the shape of the sequence is part of the signal.

But there is another kind of memory, and it is often stronger in small data: memory you write into the dataset yourself.

Instead of feeding a raw event stream into a recurrent model, you can turn time into covariates: number of events in the last hour, average stress in the last day, hours since the last anomaly, slope over the last week, maximum load in the last six hours, count of visual events in the last 24 hours. Each row then says: at time ttt, given this summarized history, what happens next?

That is explicit memory. Rolling windows, lag features, event counts, time-since-last-event variables. You are doing by hand what the LSTM would otherwise have to discover from examples. And if examples are scarce, that is often the better bargain.

The danger is that this can also explode. Ten raw variables, eight window sizes, and five statistics per window already gives you four hundred features. If the event you care about is rare, four hundred dimensions is not more intelligence; it is more room for hallucinated correlations. The model starts finding little accidents in the data and treating them as laws. That is the curse of dimensionality in its most practical form: too many possible stories, not enough real events to prove which one is true.

So the baseline is not optional. Before an LSTM, build the dumb model. Then build the slightly less dumb model. A rule. A logistic regression over carefully chosen rolling features. A gradient-boosted tree over the same features. Only then ask whether the LSTM is learning something those explicit summaries failed to capture.

The logistic regression is not the grand rival of the LSTM. It is the scientific control. It tells you whether the problem was mostly additive and already expressible in a small set of human-designed temporal variables. The more dangerous rival is gradient boosting, because it can exploit nonlinear interactions between those variables without needing to learn a full recurrent memory from scratch. In many small or medium tabular time problems, that is exactly where the LSTM loses.

This is the rule I would keep:

Chunking is explicit memory. An LSTM is learned memory. When data is scarce, noisy, subjective, or full of rare positive events, explicit memory usually wins. Reach for an LSTM only when the useful temporal pattern is real, repeated, order-dependent, and too awkward to summarize cleanly by hand.

That is the actual shape of the tool. An LSTM is not better because it has memory. A spreadsheet with good lag features also has memory. An LSTM is better only when the memory itself is the thing that must be learned.

And now the original diagram should look much less like wiring. Four little neural layers. A belt. Three valves. A readout on top. The scary machine was mostly σ(Wx+b)\sigma(W\mathbf{x}+\mathbf{b})σ(Wx+b), repeated in the right shape, and trained until the right parts of the past learned how to stay alive.