I Moved My Image Preprocessing to GPU. Here's What Actually Happened.
Everyone says GPU preprocessing is faster. I benchmarked it with a Ryzen 9 9950X3D and an RTX PRO 6000 Blackwell. The results surprised me.
The Conventional Wisdom
In deep learning with images, there's a common complaint: "My GPU is sitting idle waiting for data." The standard advice is to move preprocessing from CPU to GPU. Decode images on GPU. Run augmentations on GPU. Keep everything on the device.
I've been doing this for a while in my medical imaging work, using Kornia for GPU augmentations. But I'd never actually measured the difference on my current hardware. So I ran the benchmarks.
The Setup
My test rig is admittedly not typical:
- CPU: AMD Ryzen 9 9950X3D — 16 cores, 32 threads, 128MB of L3 cache
- GPU: NVIDIA RTX PRO 6000 Blackwell — 96GB VRAM
- RAM: 128GB DDR5
I generated 1000 synthetic JPEG images at 1024×1024, then benchmarked various preprocessing pipelines that resize to 224×224 with standard augmentations (flip, rotation, color jitter, normalization).
The Results
| Pipeline | Throughput | vs CPU |
|---|---|---|
| Batched Kornia (GPU decode) | 1273 img/s | 1.33× |
| torchvision v2 (GPU decode) | 989 img/s | 1.04× |
| CPU (8 workers) | 955 img/s | 1.00× (baseline) |
| Kornia (GPU decode + transform) | 456 img/s | 0.48× |
| torchvision v2 (CPU decode) | 333 img/s | 0.35× |
| Kornia (CPU decode) | 237 img/s | 0.25× |
Wait. The CPU with 8 workers is competitive with most GPU pipelines?
The Surprise
My Ryzen 9 9950X3D is a monster. Its 3D V-Cache (128MB L3) makes it exceptionally good at image processing workloads. With 8 parallel DataLoader workers, it pushes 955 images/second—only 33% slower than the fastest GPU pipeline.
Even more surprising: GPU preprocessing with CPU decoding is actually slower than pure CPU. The overhead of transferring decoded images to GPU, running transforms, and the lack of batching makes it worse than just doing everything on a fast CPU.
The only configuration that clearly beats CPU is batched GPU processing with GPU JPEG decoding (nvJPEG). Both ingredients matter: GPU decode eliminates the transfer bottleneck, and batching maximizes GPU parallelism.
What About Actual Training?
Raw preprocessing speed isn't the whole story. What matters is end-to-end training throughput. I added a ResNet-50 forward+backward pass (21ms on my GPU) to see the real impact:
| Pipeline | Prep Time | Total Time | % Preprocessing |
|---|---|---|---|
| Batched Kornia (GPU) | 25ms | 46ms | 54% |
| CPU (8 workers) | 33ms | 55ms | 61% |
Even with top-tier CPU, preprocessing still dominates training time. The GPU pipeline trains 18% faster—not because preprocessing is dramatically faster, but because it overlaps better with model computation.
The Real Takeaway
On high-end CPUs: The gains from GPU preprocessing are modest (1.33× at best). If you have a Ryzen 9 or Threadripper, your CPU DataLoader might be "good enough."
On typical hardware: Most people don't have a 9950X3D. On a laptop CPU or older desktop, the GPU advantage would be much larger—potentially 3-5× or more.
The critical ingredients:
- GPU JPEG decoding — Without this, you're still bottlenecked on CPU→GPU transfer
- Batched transforms — Process the whole batch at once, not image-by-image
- Keep data on GPU — Once decoded, never move it back to CPU
The Code
Here's a minimal GPU preprocessing pipeline using Kornia:
import torch
import kornia.augmentation as K
from torchvision.io import decode_jpeg
# GPU transform pipeline
transform = torch.nn.Sequential(
K.Resize((224, 224)),
K.RandomHorizontalFlip(p=0.5),
K.RandomRotation(degrees=15.0),
K.ColorJitter(0.2, 0.2, 0.2, 0.1),
K.Normalize(
mean=torch.tensor([0.485, 0.456, 0.406]),
std=torch.tensor([0.229, 0.224, 0.225])
),
).cuda()
# In your data loading:
def load_and_preprocess(jpeg_bytes):
# Decode directly on GPU (requires nvJPEG)
img = decode_jpeg(jpeg_bytes, device='cuda')
img = img.float() / 255.0
img = img.unsqueeze(0) # Add batch dim
img = transform(img)
return img.squeeze(0)
For maximum throughput, batch multiple images before applying transforms—Kornia handles batches natively.
When to Bother
Use GPU preprocessing when:
- Your GPU utilization is below 80% during training
- You're on a laptop or mid-range desktop CPU
- You have GPU JPEG decode support (most modern NVIDIA GPUs)
- You're training models where data loading is the bottleneck
Don't bother when:
- You have a high-end CPU (Ryzen 9, Threadripper, Xeon)
- Your model is slow enough that data loading isn't the bottleneck
- You're doing inference (no augmentation needed)
- You don't have nvJPEG support
Final Thoughts
I expected GPU preprocessing to be a clear win. The reality is more nuanced: it depends heavily on your CPU. With top-tier silicon, the CPU holds its own. With typical hardware, GPU preprocessing is a significant optimization.
The 18% training speedup I measured is real and compounds over long training runs. But if you're already getting good GPU utilization with CPU DataLoaders, don't feel obligated to rewrite your pipeline.
Measure first. Optimize where it matters.
Full benchmark notebook: github.com/isztld/gpu-preprocessing-benchmark
Hardware:
CPU: AMD Ryzen 9 9950X3D (16C/32T)
GPU: NVIDIA RTX PRO 6000 Blackwell (102GB)
PyTorch: 2.8.0+cu129
Kornia: 0.8.1