Stacking SWA Notes

What

Why stacking sliding window attention does not let models “see” very far in practice.

Context

I found an article I liked and decided to write short notes in simple steps.
Article: Why Stacking Sliding Windows Can’t See Very Far by Guangxuan Xiao.

Notes 1) The window

Each layer only looks back W tokens.
Example with W=10: at token 100, the layer sees tokens 91–100.
Tiny check: With W=100, token 1000 directly sees tokens 901–1000.

2) Layers: reach vs influence

Stacking L layers suggests info could hop W at a time, so reach ≈ L×W.
What matters is influence, not just reach. Far info becomes a faint whisper.
Words to keep: reach = could travel; influence = actually affects the output.
Tiny check: With L=3 and W=100, nominal reach is 300 tokens; influence that far back is small.

3) Without residuals: spreading like a blur

Averaging the last W tokens each layer is repeated blurring.
Repeated blurs spread slowly, like diffusion.
Rule: effective spread grows like sqrt(L) windows, not L windows.
Sanity example: L=100, W=100 → useful spread ≈ 10×100 = about 1000 tokens.

4) With residuals: the exponential barrier

Two paths per layer: residual keeps most of the current token (α ~ 0.9–0.99); attention adds a small slice (1−α) from the window.
To carry info from distance d, it must hop k ≈ ceil(d/W) times through that small slice.
Each hop multiplies influence by (1−α). After k hops: influence ≈ (1−α)^k.
Numbers you can feel: α=0.95 → per window 0.05; then 0.05, 0.0025, 0.000125.

5) Rule of thumb horizon

Influence at distance d (tokens) ≈ C · (1−α)^(d/W).
Practical horizon is where influence falls below your tolerance (e.g., 0.1%).
Example: α=0.95, W=100 → a few windows (≈300–400 tokens) already very small.

6) If you need longer reach

Increase W (costlier), soften the residual (lower effective α), add non‑local routes (global/memory tokens, retrieval, sparse long‑range heads), or use persistent state models (state space models) to carry information without exponential loss.

Pitfalls

Confusing reach with influence; assuming depth alone solves long range.
Ignoring α — when α is high, distant info drops exponentially.
Forgetting tolerance — “how small is too small” is task dependent.

Links

https://guangxuanx.com/blog/stacking-swa.html