Reading the Profiler: From nn.Linear to a Fused MLP

Reading the Profiler: From nn.Linear to a Fused MLP — type0 | type0

PREVIEWReading the Profiler: From nn.Linear to a Fused MLP · MD

The PyTorch profiler is installed on most deep learning machines, and most of the engineers who could open it have not. The cost of that gap is concrete: time spent tuning kernels, batch sizes, or model architectures for problems the profiler would have diagnosed in seconds. A new HuggingFace post, Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP by Aritra Roy Gosthipaty, Rémi Ouazan Reboul, Sergio Paniego, Pedro Cuenca, and Sayak Paul, treats profiling as a skill you build, not a black box you stare at. Its real argument is that the bottleneck in most PyTorch code is scheduling, not math.

Part 1 of the series set the stage with the smallest possible example: torch.add(torch.matmul(x, w), b). That pair looks like one fused operation to a reader, but on a GPU it dispatches as separate kernels with a CPU round trip in between. The lesson was that the gap between what you wrote and what your GPU ran is the whole story, and that the way to close it is to read the trace.

Part 2 starts where Part 1 ended, swapping the hand-written matmul-add for nn.Linear(bias=True) and stacking three of those layers with an inter-layer activation to form an MLP block. From there, the post walks through the same profiling literacy the first post established, but on a shape that resembles a real model. Three companion scripts (02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py, linked from the post) let a reader reproduce the arc on an NVIDIA A100-SXM4-80GB in a few minutes via HuggingFace Spaces Dev Mode or Jobs.

Two regimes organize the whole piece. The first is overhead-bound, where CPU dispatch and kernel launch overhead dominate, and where making the kernels themselves faster has diminishing returns. The second is compute-bound, where the actual arithmetic dominates, and where fusion helps only insofar as it cuts redundant memory traffic between kernels. Most beginner code is overhead-bound. Most production-scale code is compute-bound. The win from fusion is conditional on which regime you are in, and the only way to know is to read the trace.

That is why the post's pacing matters. It does not jump to a fused implementation. It walks from one nn.Linear, to a stacked MLP, to a fused MLP, with the profiler output at each step. The reader watches the dispatch overhead shrink as operations collapse into single kernels, and they learn to recognize the same pattern in their own code. The constructive payoff is that profiling becomes a habit: a way of asking what is actually executing on the GPU, in what order, before reaching for a kernel library.

The honest caveat is hardware. The benchmarks in the HuggingFace post are run on a single card, the A100-SXM4-80GB, and the relative cost of dispatch overhead versus arithmetic shifts across hardware generations and across model scales. A reader applying the same recipes to an H100, or to a smaller card, will see different numbers. The mental model travels; the percentages do not.

What to watch next is whether profiling literacy becomes a first-class part of model training guides rather than a debugging afterthought. Posts like this one, and the runnable artifacts shipped alongside it, suggest the answer is starting to shift, one nn.Linear at a time.

Reading the Profiler: From nn.Linear to a Fused MLP

Sources