// from 2000 GPUs to 384KB of SRAM

I Moved My Image Preprocessing to GPU. Here's What Actually Happened.

Everyone says GPU preprocessing is faster. I benchmarked it with a Ryzen 9 9950X3D and an RTX PRO 6000 Blackwell. The results surprised me.

The Conventional Wisdom

In deep learning with images, there's a common complaint: "My GPU is sitting idle waiting for data." The standard advice is to move preprocessing from CPU to GPU. Decode images on GPU. Run augmentations on GPU. Keep everything on the device.

I've been doing this for a while in my medical imaging work, using Kornia for GPU augmentations. But I'd never actually measured the difference on my current hardware. So I ran the benchmarks.

The Setup

My test rig is admittedly not typical:

I generated 1000 synthetic JPEG images at 1024×1024, then benchmarked various preprocessing pipelines that resize to 224×224 with standard augmentations (flip, rotation, color jitter, normalization).

The Results

Pipeline Throughput vs CPU
Batched Kornia (GPU decode) 1273 img/s 1.33×
torchvision v2 (GPU decode) 989 img/s 1.04×
CPU (8 workers) 955 img/s 1.00× (baseline)
Kornia (GPU decode + transform) 456 img/s 0.48×
torchvision v2 (CPU decode) 333 img/s 0.35×
Kornia (CPU decode) 237 img/s 0.25×

Wait. The CPU with 8 workers is competitive with most GPU pipelines?

The Surprise

My Ryzen 9 9950X3D is a monster. Its 3D V-Cache (128MB L3) makes it exceptionally good at image processing workloads. With 8 parallel DataLoader workers, it pushes 955 images/second—only 33% slower than the fastest GPU pipeline.

Even more surprising: GPU preprocessing with CPU decoding is actually slower than pure CPU. The overhead of transferring decoded images to GPU, running transforms, and the lack of batching makes it worse than just doing everything on a fast CPU.

Key Insight

The only configuration that clearly beats CPU is batched GPU processing with GPU JPEG decoding (nvJPEG). Both ingredients matter: GPU decode eliminates the transfer bottleneck, and batching maximizes GPU parallelism.

What About Actual Training?

Raw preprocessing speed isn't the whole story. What matters is end-to-end training throughput. I added a ResNet-50 forward+backward pass (21ms on my GPU) to see the real impact:

18%
faster effective training throughput with GPU preprocessing
Pipeline Prep Time Total Time % Preprocessing
Batched Kornia (GPU) 25ms 46ms 54%
CPU (8 workers) 33ms 55ms 61%

Even with top-tier CPU, preprocessing still dominates training time. The GPU pipeline trains 18% faster—not because preprocessing is dramatically faster, but because it overlaps better with model computation.

The Real Takeaway

On high-end CPUs: The gains from GPU preprocessing are modest (1.33× at best). If you have a Ryzen 9 or Threadripper, your CPU DataLoader might be "good enough."

On typical hardware: Most people don't have a 9950X3D. On a laptop CPU or older desktop, the GPU advantage would be much larger—potentially 3-5× or more.

The critical ingredients:

  1. GPU JPEG decoding — Without this, you're still bottlenecked on CPU→GPU transfer
  2. Batched transforms — Process the whole batch at once, not image-by-image
  3. Keep data on GPU — Once decoded, never move it back to CPU

The Code

Here's a minimal GPU preprocessing pipeline using Kornia:

import torch
import kornia.augmentation as K
from torchvision.io import decode_jpeg

# GPU transform pipeline
transform = torch.nn.Sequential(
    K.Resize((224, 224)),
    K.RandomHorizontalFlip(p=0.5),
    K.RandomRotation(degrees=15.0),
    K.ColorJitter(0.2, 0.2, 0.2, 0.1),
    K.Normalize(
        mean=torch.tensor([0.485, 0.456, 0.406]),
        std=torch.tensor([0.229, 0.224, 0.225])
    ),
).cuda()

# In your data loading:
def load_and_preprocess(jpeg_bytes):
    # Decode directly on GPU (requires nvJPEG)
    img = decode_jpeg(jpeg_bytes, device='cuda')
    img = img.float() / 255.0
    img = img.unsqueeze(0)  # Add batch dim
    img = transform(img)
    return img.squeeze(0)

For maximum throughput, batch multiple images before applying transforms—Kornia handles batches natively.

When to Bother

Use GPU preprocessing when:

Don't bother when:

Final Thoughts

I expected GPU preprocessing to be a clear win. The reality is more nuanced: it depends heavily on your CPU. With top-tier silicon, the CPU holds its own. With typical hardware, GPU preprocessing is a significant optimization.

The 18% training speedup I measured is real and compounds over long training runs. But if you're already getting good GPU utilization with CPU DataLoaders, don't feel obligated to rewrite your pipeline.

Measure first. Optimize where it matters.

Resources

Full benchmark notebook: github.com/isztld/gpu-preprocessing-benchmark

Hardware:
CPU: AMD Ryzen 9 9950X3D (16C/32T)
GPU: NVIDIA RTX PRO 6000 Blackwell (102GB)
PyTorch: 2.8.0+cu129
Kornia: 0.8.1