Architecture
| Aspect | CPU | GPU |
|---|---|---|
| Cores | 4–128 complex cores | 1 000s of small ALUs (SIMT) |
| Clock | 3–6 GHz | 1–2.5 GHz |
| Pipelines | Deep, out-of-order | Simpler, in-order per lane |
| Branch prediction | Sophisticated | Minimal — divergence is costly |
| Cache per core | 32 KB L1 + MBs L2/L3 | ~KB shared memory + L1 |
| Memory b/w | 50–500 GB/s | 500–3 000 GB/s (HBM) |
| Latency | Low (~1 ns L1) | Hidden by massive parallelism |
| Thread model | Few, heavy threads | Warps/waves of 32–64 threads |
When each wins
| Workload | Better |
|---|---|
| Serial / branchy code | CPU |
| OS, databases, compilers | CPU |
| Matrix multiplies / neural nets | GPU |
| Graphics, ray tracing | GPU |
| Scientific sims (lattice methods) | GPU |
| Small payloads, low latency | CPU |
| Large payloads, throughput | GPU |
Specialized accelerators
| TPU (Google) | Tensor Processing Unit — dense matmul, training/inference |
|---|---|
| Trainium / Inferentia (AWS) | Cloud ML accelerators |
| NPU | Neural Processing Unit — on-device ML (Apple, Qualcomm) |
| FPGA | Reconfigurable hardware for custom pipelines |
| DPU | Data Processing Unit — offload networking/storage |
Was this article helpful?