tagged “Triton”
1 entries
- 01A mildly cursed 3.5× triton
tl.reduceoptimization TL;DR: For small, compile-time K, manually unrolling a 3D→2D bitwise-OR reduction can beat tl.reduce by ~3.5×. I’ll take you on a small adventure of some …
tl.reduce optimization
TL;DR: For small, compile-time K, manually unrolling a 3D→2D bitwise-OR reduction can beat tl.reduce by ~3.5×.
I’ll take you on a small adventure of some …