Flash retention – scaling attention and retention mechanisms
The original attention mechanism and its scaling cost Self‑attention, introduced around 2017, allows each token in a sequence to interact with every other token. The mechanism builds a dense attention matrix whose size grows quadratically with the sequence length. This design is elegant but expensive: the time and memory