TurboQuant Demystified: Online Vector Quantization with Near-optimal Distortion Rate
Google Research just published TurboQuant, an online vector quantizer that compresses KV cache to 3.5 bits with zero quality loss, and beats data-dependent product quantization in nearest neighbor recall—with an indexing time of essentially zero. The secret is a three-step trick: rotate, quantize per coordinate, fix the inner product bias with a 1-bit residual.
- The Problem: What Are We Quantizing and Why?
- Step 1 — The Random Rotation Trick
- Step 2 — The Beta Distribution on the Sphere
- Step 3 — Optimal Scalar Quantization via Lloyd-Max
- TurboQuantmse: Putting It Together
- The Bias Problem: Why MSE-Optimal ≠ Inner Product Optimal
- QJL: A 1-bit Unbiased Inner Product Quantizer
- TurboQuantprod: The Two-Stage Fix
- Lower Bounds: How Close to Optimal?
- Experiments: KV Cache and Nearest Neighbors
- Conclusion
The Problem: What Are We Quantizing and Why?
Vector quantization is the problem of compressing a $d$-dimensional floating-point vector $\mathbf{x} \in \mathbb{R}^d$ into $B$ bits, while keeping the reconstructed vector $\tilde{\mathbf{x}}$ as close as possible to the original. The map $Q: \mathbb{R}^d \to \{0,1\}^B$ and its inverse $Q^{-1}: \{0,1\}^B \to \mathbb{R}^d$ must be fast—fast enough to run on every token in a live LLM inference stream.
There are two distortion measures we care about. The first is mean-squared error:
$$D_{\text{mse}} := \mathbb{E}_Q\left[\left\| \mathbf{x} - Q^{-1}(Q(\mathbf{x})) \right\|_2^2 \right]$$The second, and harder, is inner product distortion. For any query $\mathbf{y}$, we want the inner product $\langle \mathbf{y}, \tilde{\mathbf{x}} \rangle$ to be close to $\langle \mathbf{y}, \mathbf{x} \rangle$:
$$D_{\text{prod}} := \mathbb{E}_Q\left[\left| \langle \mathbf{y}, \mathbf{x} \rangle - \langle \mathbf{y}, Q^{-1}(Q(\mathbf{x})) \rangle \right|^2 \right]$$And for inner products we also want unbiasedness: the estimate should be right on average, not just close in variance.
MSE matters for KV value reconstruction—you'll multiply attention weights by the dequantized values and sum them. Inner product matters for KV key similarity—softmax scores are dot products between query and (dequantized) keys. Getting the dot products wrong means you attend to the wrong tokens.
The tension in this field is between two unsatisfactory camps. Data-dependent methods (product quantization, GPTQ, AWQ) run k-means or Hessian-based optimization on your dataset and get great distortion—but they can't run on a live KV cache because you don't know the data in advance. Data-oblivious methods are fast and online but historically give up a lot in distortion quality, especially at low bit-widths.
TurboQuant claims to be in neither camp. It's fully data-oblivious, runs per-vector with no calibration, and achieves distortion within a factor of $\approx 2.7$ of what Shannon says is theoretically possible for any quantizer. Let's see how.
Step 1 — The Random Rotation Trick
The first move TurboQuant makes is to multiply the input vector by a random orthogonal matrix $\boldsymbol{\Pi} \in \mathbb{R}^{d \times d}$, generated by QR-decomposing a matrix of i.i.d. Gaussians:
$$\mathbf{y} = \boldsymbol{\Pi} \cdot \mathbf{x}$$Why? Because $\boldsymbol{\Pi}$ is a rotation, $\|\mathbf{y}\|_2 = \|\mathbf{x}\|_2$—no information is lost. But the key consequence is geometric: if $\mathbf{x}$ was a worst-case adversarially-chosen vector, after multiplying by a random rotation, the result $\mathbf{y}$ is now uniformly distributed on the unit sphere, regardless of what $\mathbf{x}$ was.
This is the same trick FlashAttention's friends QuIP# and QuaRot use to neutralize outlier channels in weight quantization. The rotation doesn't change what the model computes (it's invertible and we rotate back after dequantization), but it turns the hardest possible input into a generic one.
After a random rotation, every coordinate of $\mathbf{y}$ has the same statistical distribution—regardless of the structure of the original $\mathbf{x}$. This means we can design a single optimal quantizer and apply it independently to every coordinate.
Now, what is that distribution exactly? This is where the Beta distribution enters.
Step 2 — The Beta Distribution on the Sphere
Take a vector uniformly distributed on the unit sphere $\mathbb{S}^{d-1}$. What does any single coordinate look like? The paper derives the answer directly from geometry: the density of the $j$-th coordinate $y_j$ is
$$f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi} \cdot \Gamma((d-1)/2)} \left(1 - x^2\right)^{(d-3)/2}, \quad x \in [-1, 1]$$This is a scaled Beta distribution on $[-1, 1]$. The intuition: the probability of a coordinate taking value $x$ is proportional to the $(d-1)$-dimensional sphere cross-section at height $x$, which has radius $\sqrt{1-x^2}$ and surface area proportional to $(1-x^2)^{(d-2)/2}$, divided by the extra $\sqrt{1-x^2}$ factor from the Pythagorean change of variables.
In high dimensions ($d \to \infty$), this Beta distribution converges to $\mathcal{N}(0, 1/d)$. The distribution becomes concentrated: most of the mass is near zero, in a window of width $\sim 1/\sqrt{d}$. The interactive visualization below lets you see this convergence as $d$ grows.
As $d$ increases, the Beta distribution converges to $\mathcal{N}(0, 1/d)$ (dashed). At $d \approx 64$ the distributions are nearly indistinguishable. This means the same precomputed optimal quantizer works for all large $d$.
The second crucial fact is near-independence. In high dimensions, two distinct coordinates $y_i$ and $y_j$ of a random unit vector are nearly independent—not just uncorrelated (which would follow from symmetry) but genuinely close to independent in the information-theoretic sense. This means we can apply independent scalar quantizers to each coordinate without worrying about cross-coordinate interactions.
Step 3 — Optimal Scalar Quantization via Lloyd-Max
Now we have a known distribution—either the exact Beta or its Gaussian approximation—and we want to quantize samples from it optimally with $b$ bits, i.e., using $2^b$ levels. This is the classic 1D k-means problem: find $2^b$ centroids $c_1 \le c_2 \le \ldots \le c_{2^b}$ that minimize
$$\mathcal{C}(f_X, b) := \min_{c_1 \le \ldots \le c_{2^b}} \sum_{i=1}^{2^b} \int_{\frac{c_{i-1}+c_i}{2}}^{\frac{c_i+c_{i+1}}{2}} |x - c_i|^2 \cdot f_X(x)\, dx$$This is solved by the Lloyd-Max algorithm: iteratively update boundaries to midpoints between consecutive centroids, and update centroids to the conditional mean of $f_X$ within each cell. It converges in a handful of iterations for smooth distributions like the Beta.
Crucially, this step is done once offline. The resulting codebooks are stored for each bit-width $b \in \{1, 2, 3, 4, \ldots\}$. At runtime, quantizing a vector is just: (1) compute $\mathbf{y} = \boldsymbol{\Pi} \cdot \mathbf{x}$, and (2) for each coordinate, look up the nearest centroid with a binary search. Fully vectorizable. No k-means. No calibration data.
For small bit-widths the optimal centroids have a clean closed form. At $b=1$, the optimal 2-level quantizer for $\mathcal{N}(0, 1/d)$ has centroids $\left\{\pm \sqrt{2/(\pi d)}\right\}$—the positive and negative means of a half-normal. At $b=2$, the four centroids are approximately $\left\{\pm 0.453/\sqrt{d},\, \pm 1.510/\sqrt{d}\right\}$.
Centroids (vertical lines) partition the distribution into $2^b$ Voronoi cells. The dashed boundaries are the midpoints between consecutive centroids. Notice how the cells are narrow near the peak and wider in the tails—exactly where the distribution has less mass.
For asymptotically large $b$, the Panter-Dite formula gives the optimal scalar quantizer MSE as:
$$\mathcal{C}(f_X, b) \le \frac{1}{12} \left(\int f_X(x)^{1/3}\, dx\right)^3 \cdot \frac{1}{4^b} = \frac{\sqrt{3}\pi}{2d} \cdot \frac{1}{4^b}$$The $1/4^b$ factor is the key—distortion halves every time you add one bit. This is the same exponential improvement you get from the Shannon distortion-rate function.
TurboQuantmse: Putting It Together
The MSE-optimal TurboQuant algorithm is remarkably concise:
import torch
import numpy as np
class TurboQuantMSE:
"""
TurboQuant optimized for MSE distortion.
Args:
d: vector dimension
b: bits per coordinate (1, 2, 3, or 4)
codebook: precomputed Lloyd-Max centroids for Beta(d) distribution
"""
def __init__(self, d: int, b: int, codebook: torch.Tensor) -> None:
self.d = d
self.b = b
# Random rotation matrix (generated once, shared across calls)
G = torch.randn(d, d)
self.Pi, _ = torch.linalg.qr(G) # [d, d] orthogonal matrix
# codebook: [2**b] centroids in sorted order
self.codebook = codebook # [2^b]
def quant(self, x: torch.Tensor) -> torch.Tensor:
"""
Args: x: [d] unit-norm vector
Returns idx: [d] integer indices in [0, 2^b)
"""
y = self.Pi @ x # [d] — rotate to generic position
# Find nearest centroid for each coordinate: broadcasted L1 search
# y[:, None] is [d, 1], codebook[None, :] is [1, 2^b] → distances [d, 2^b]
dists = (y[:, None] - self.codebook[None, :]).abs()
idx = dists.argmin(dim=-1) # [d] — b-bit integers
return idx
def dequant(self, idx: torch.Tensor) -> torch.Tensor:
"""
Args: idx: [d] integer indices in [0, 2^b)
Returns x_hat: [d] reconstructed vector
"""
y_hat = self.codebook[idx] # [d] — retrieve centroids
x_hat = self.Pi.T @ y_hat # [d] — rotate back
return x_hat
For any unit-norm $\mathbf{x} \in \mathbb{S}^{d-1}$, TurboQuantmse achieves:
$$D_{\text{mse}} \le \frac{\sqrt{3}\,\pi}{2} \cdot \frac{1}{4^b}$$At small bit-widths: $b=1 \Rightarrow D \approx 0.36$, $b=2 \Rightarrow D \approx 0.117$, $b=3 \Rightarrow D \approx 0.030$, $b=4 \Rightarrow D \approx 0.009$.
The proof is clean. Since $\boldsymbol{\Pi}$ is an isometry, $\|\mathbf{x} - \tilde{\mathbf{x}}\|_2 = \|\mathbf{y} - \tilde{\mathbf{y}}\|_2$. Since all coordinates of $\mathbf{y}$ are i.i.d. with distribution $f_X$, the total MSE is exactly $d$ times the per-coordinate quantization error $\mathcal{C}(f_X, b)$. The bound then follows from the Panter-Dite formula applied to $f_X$.
The Bias Problem: Why MSE-Optimal ≠ Inner Product Optimal
Here's where things get subtle. Minimizing MSE does not mean the inner product estimator $\langle \mathbf{y}, \tilde{\mathbf{x}} \rangle$ is unbiased. In fact, at $b=1$, there's a systematic multiplicative bias of $2/\pi \approx 0.637$.
Let's see why. At $b=1$ with a normal distribution, the optimal codebook is $\{-\sqrt{2/(\pi d)},\, +\sqrt{2/(\pi d)}\}$, so the quantizer just takes the sign: $Q_{\text{mse}}(\mathbf{x}) = \text{sign}(\boldsymbol{\Pi} \cdot \mathbf{x})$, and the dequantizer is $\tilde{\mathbf{x}} = \sqrt{2/(\pi d)} \cdot \boldsymbol{\Pi}^\top \cdot \text{sign}(\boldsymbol{\Pi} \cdot \mathbf{x})$. The expected inner product with any query $\mathbf{y}$ is:
$$\mathbb{E}\left[\langle \mathbf{y}, \tilde{\mathbf{x}} \rangle\right] = \frac{2}{\pi} \cdot \langle \mathbf{y}, \mathbf{x} \rangle$$The $2/\pi$ factor comes from the well-known identity for sign-random-projection estimators. This bias shrinks as $b$ increases—at $b=4$ it's nearly gone—but it's a fundamental issue at low bit-widths.
In the attention mechanism, softmax scores are computed from inner products of queries and keys. A systematic multiplicative bias on all key-query dot products shifts the distribution of attention weights. Tokens that should receive low attention get a small absolute boost; tokens with high scores are shrunk. Over many layers, this compounds.
Each point is a (true inner product, estimated inner product) pair for TurboQuantmse. A perfect unbiased estimator would lie on the diagonal. At low bits, points systematically fall below the diagonal—the bias. TurboQuantprod (shown in green) stays on the diagonal at all bit-widths.
QJL: A 1-bit Unbiased Inner Product Quantizer
To fix the bias, TurboQuant borrows a building block from an earlier paper: the Quantized Johnson-Lindenstrauss (QJL) transform. QJL is a 1-bit inner product quantizer that is provably unbiased.
Given a random matrix $\mathbf{S} \in \mathbb{R}^{d \times d}$ with i.i.d. $\mathcal{N}(0,1)$ entries, QJL is defined as:
$$Q_{\text{qjl}}(\mathbf{x}) := \text{sign}(\mathbf{S} \cdot \mathbf{x}) \in \{-1, +1\}^d$$ $$Q_{\text{qjl}}^{-1}(\mathbf{z}) := \frac{\sqrt{\pi/2}}{d} \cdot \mathbf{S}^\top \cdot \mathbf{z}$$The unbiasedness follows from a classical fact: for any two vectors $\mathbf{x}, \mathbf{y}$ and a random Gaussian vector $\mathbf{s}$,
$$\mathbb{E}\left[\mathbf{s}^\top \mathbf{y} \cdot \text{sign}(\mathbf{s}^\top \mathbf{x})\right] = \sqrt{2/\pi} \cdot \langle \mathbf{y}, \mathbf{x} \rangle$$The $\sqrt{\pi/2}$ prefactor in $Q_{\text{qjl}}^{-1}$ exactly cancels this, giving an unbiased estimator. The variance bound from the QJL paper is:
$$\text{Var}\left(\langle \mathbf{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\mathbf{x})) \rangle\right) \le \frac{\pi}{2d} \cdot \|\mathbf{y}\|_2^2$$The variance is $O(1/d)$—it gets better with dimension. This is because we're averaging $d$ independent sign-projection estimators, one per row of $\mathbf{S}$, and their variance divides by $d$.
TurboQuantprod: The Two-Stage Fix
The inner-product-optimal TurboQuant combines both ingredients. Given a budget of $b$ bits per coordinate:
- Run TurboQuantmse with bit-width $b-1$ on $\mathbf{x}$. Store the indices
idx. - Compute the residual $\mathbf{r} = \mathbf{x} - Q_{\text{mse}}^{-1}(Q_{\text{mse}}(\mathbf{x}))$.
- Apply QJL to the residual:
qjl = sign(S · r). Also store $\|\mathbf{r}\|_2$.
The total bit cost is $(b-1)$ bits for the MSE indices plus $1$ bit per coordinate for the QJL signs—exactly $b$ bits per coordinate. The dequantized vector is:
$$\tilde{\mathbf{x}} = Q_{\text{mse}}^{-1}(\texttt{idx}) + \|\mathbf{r}\|_2 \cdot Q_{\text{qjl}}^{-1}(\texttt{qjl})$$Why does this work? The MSE stage produces a coarse reconstruction with a small residual $\mathbf{r}$. The QJL stage provides an unbiased correction for that residual. Since the QJL estimator is unbiased, the total estimator is unbiased:
$$\mathbb{E}[\langle \mathbf{y}, \tilde{\mathbf{x}} \rangle] = \langle \mathbf{y}, Q_{\text{mse}}^{-1}(\texttt{idx}) \rangle + \langle \mathbf{y}, \mathbf{r} \rangle = \langle \mathbf{y}, \mathbf{x} \rangle$$class TurboQuantProd:
"""
TurboQuant optimized for inner product distortion.
Two-stage: (b-1)-bit MSE quantizer + 1-bit QJL on the residual.
"""
def __init__(self, d: int, b: int, codebook: torch.Tensor) -> None:
self.mse = TurboQuantMSE(d, b - 1, codebook) # one bit fewer
# Random Gaussian matrix for QJL
self.S = torch.randn(d, d) # [d, d] i.i.d. N(0,1)
def quant(self, x: torch.Tensor) -> tuple:
"""
Args: x: [d] unit-norm vector
Returns (idx, qjl, gamma):
idx: [d] MSE indices (b-1 bits per coord)
qjl: [d] ±1 signs (1 bit per coord)
gamma: scalar residual norm
"""
idx = self.mse.quant(x) # [(b-1)-bit MSE indices]
r = x - self.mse.dequant(idx) # [d] residual vector
gamma = r.norm() # scalar — residual norm
qjl = (self.S @ r).sign() # [d] ±1 signs
return idx, qjl, gamma
def dequant(self, idx, qjl, gamma) -> torch.Tensor:
x_mse = self.mse.dequant(idx) # [d] coarse reconstruction
# QJL inverse: scale S^T @ signs by sqrt(pi/2) / d / gamma
x_qjl = (np.sqrt(np.pi / 2) / self.S.shape[0]) * gamma * (self.S.T @ qjl)
return x_mse + x_qjl # [d] combined reconstruction
For any unit-norm $\mathbf{x}$ and any $\mathbf{y}$, TurboQuantprod achieves:
- Unbiased: $\mathbb{E}[\langle \mathbf{y}, \tilde{\mathbf{x}} \rangle] = \langle \mathbf{y}, \mathbf{x} \rangle$
- Distortion: $D_{\text{prod}} \le \dfrac{\sqrt{3}\,\pi^2\, \|\mathbf{y}\|_2^2}{d} \cdot \dfrac{1}{4^b}$
At $b=1,2,3,4$: $D_{\text{prod}} \approx 1.57/d,\ 0.56/d,\ 0.18/d,\ 0.047/d$.
The distortion proof uses the tower property of conditional expectation and the QJL variance bound from above. The key step is conditioning on the MSE output and observing that the remaining error is entirely due to the QJL variance on the residual $\mathbf{r}$, scaled by $\|\mathbf{r}\|_2^2 = D_{\text{mse}}(b-1)$.
The two-stage pipeline for a $b$-bit inner product quantizer. Stage 1 handles the bulk of the MSE reconstruction at $b-1$ bits. Stage 2 uses the remaining 1-bit budget to correct inner product bias on the residual.
Lower Bounds: How Close to Optimal?
Near-optimality claims need a lower bound to be meaningful. TurboQuant proves one using Yao's minimax principle and Shannon's Rate-Distortion theory.
Yao's principle converts the question from "how good is the best randomized algorithm on worst-case inputs?" to "how good is the best deterministic algorithm on the hardest random input distribution?" The hardest distribution for MSE on the unit sphere is the uniform distribution—so the problem reduces to: what's the Shannon distortion-rate bound for a uniform distribution on $\mathbb{S}^{d-1}$?
The Shannon Lower Bound (SLB) says: for any distribution with differential entropy $h(\mathbf{x})$ and bit budget $B$,
$$D(B) \ge \frac{d}{2\pi e} \cdot 2^{(2/d)(h(\mathbf{x}) - B)}$$Plugging in the entropy of the uniform distribution on $\mathbb{S}^{d-1}$ and simplifying with Stirling's approximation gives:
$$D_{\text{mse}}(Q) \ge \frac{1}{4^b}, \qquad D_{\text{prod}}(Q) \ge \frac{1}{d} \cdot \frac{1}{4^b}$$The lower bound is $1/4^b$. TurboQuantmse achieves $(\sqrt{3}\pi/2)/4^b$. The ratio $\sqrt{3}\pi/2 \approx 2.72$ is the gap—and it's a universal constant, independent of $b$, $d$, or the input. At $b=1$ the gap is even smaller: TurboQuant is only $1.45\times$ away from optimal.
Both axes log-scale. TurboQuantmse (red) and TurboQuantprod (green) track the lower bound (dashed) with a constant vertical offset of $\approx \log_2(2.7) \approx 1.43$ bits. Existing methods like scalar quantization without rotation show a polynomial rather than exponential improvement.
Experiments: KV Cache and Nearest Neighbors
KV Cache Quantization
The paper evaluates TurboQuant on the Needle-in-a-Haystack benchmark with Llama-3.1-8B-Instruct and sequences from 4k to 104k tokens, at a $4\times$ compression ratio. TurboQuant achieves a recall score of 0.997, identical to the uncompressed baseline (also 0.997). By comparison: KIVI scores 0.981, PyramidKV 0.895, SnapKV 0.858.
On LongBench-E with 3.5-bit quantization, TurboQuant matches the full-precision model's average score (50.06 vs. 50.06) across six categories: single-doc QA, multi-doc QA, summarization, few-shot, synthetic, and code.
One implementation detail worth noting: the paper uses a non-integer bit-width of 3.5 bits by identifying a fixed number of "outlier" channels (those with consistently large magnitude) and allocating 4 bits to them, 3 bits to the rest. This is a standard trick across quantization literature. At 2.5-bit quantization TurboQuant still achieves 49.44 average—within 1.2% of full precision, at $6.4\times$ compression.
Nearest Neighbor Search
TurboQuant is evaluated on DBpedia (1536-d and 3072-d OpenAI embeddings) and GloVe (200-d). The comparison is against Product Quantization with AVX2-optimized LUT256 lookup tables, and RabitQ.
The headline result is the indexing time. PQ requires running k-means to build codebooks. At $d=1536$ that takes 239 seconds. RabitQ takes 2267 seconds. TurboQuant: 0.0013 seconds. That's five orders of magnitude.
Product Quantization (4-bit, d=1536)
Indexing time: 239.75s
Requires calibration data to build codebook. LUT256 with 256 codewords per sub-space. Benefits from dataset-specific tuning.
TurboQuant (4-bit, d=1536)
Indexing time: 0.0013s
Fully online. No calibration. Uses precomputed Lloyd-Max codebook for Beta distribution. Vectorized on any GPU.
Despite PQ having a homefield advantage (same data for train and test), TurboQuant achieves higher recall across all three datasets and bit-widths tested. The improvement is most pronounced at low bits ($b=2$), where PQ's codebook—optimized for the full dataset—struggles with out-of-distribution vectors, while TurboQuant's data-agnostic rotation provides robust coverage.
Recall@1 on 100k DBpedia vectors with 1k query points. TurboQuant (green) outperforms PQ (orange) at all bit-widths. RabitQ (purple) comes close at 4-bit but at 3000× the indexing cost.
Conclusion
TurboQuant's contribution is simple to state: a fully online, data-oblivious vector quantizer that achieves near-optimal distortion in both MSE and inner product, with a proof. The algorithm has three moving pieces:
- Random rotation: neutralize the worst-case input, induce a known coordinate distribution.
- Lloyd-Max scalar quantization: apply precomputed optimal codebooks independently to each coordinate.
- QJL residual correction: one extra bit per coordinate removes the inner product bias introduced by the MSE quantizer.
The gap to the information-theoretic lower bound is $\approx 2.7\times$—a small constant that holds for all bit-widths and all input vectors. No other online quantizer achieves this.
The practical implications are real. KV cache at 3.5 bits is quality-neutral in LongBench. Nearest neighbor indexing is effectively free. The algorithm is fully vectorizable—the rotation is a single matrix multiply, the codebook lookup is an argmin over $2^b$ values, the QJL is a sign of a matrix multiply. Everything runs on GPU with no serial bottlenecks.
The open question is whether the $2.7\times$ constant is tight or whether an even tighter construction exists. Shannon's bound comes from a backward Gaussian channel argument that may not be achievable with practical algorithms. That gap is where the next paper will live.