Technical deep-dives on distributed systems, GPU computing, and ML infrastructure. I write about things I wish someone had explained to me.
Recent Posts
-
2026-02-07
Online Softmax Demystified: From PyTorch to FlashAttention in Triton
The trick that makes FlashAttention possible. Step-by-step from PyTorch to Triton, with no magic operators.
-
2026-01-30
Attention Mechanism: From Math to GPU
A deep dive into attention: the math, three implementations, and how it maps to GPU memory hierarchies. With interactive visualizations.
-
2026-01-21
I Moved My Image Preprocessing to GPU. Here's What Actually Happened.
GPU preprocessing is supposed to be faster. With a Ryzen 9 9950X3D vs RTX PRO 6000, the results surprised me.
-
2026-01-16
Why AI Can't Change Its Mind (And Why That's a Problem)
Current AI models follow a single reasoning path and can't recover from mistakes. What if the solution is making them more human?
-
2026-01-14
PyTorch: The Quiet Revolution in Python Numerical Computing
I replaced NumPy with PyTorch. FFT is now 1700× faster. Here's what nobody is talking about.