What is the reason for the speedup of transformer-xl?

Question

The inference speed of transformer-xl is faster than transformer.

Why?

If state reuse is the reason, so it is compared by two 32seq_len + state-reuse vs one 64seq_len + no-state-reuse?

No correct solution

Licensed under: CC-BY-SA with attribution