How to Think About GPUs

jax-ml.github.io

220 points by alphabetting 2 days ago

tormeh 7 hours ago

I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.

raincole an hour ago

Yeah that was what I told myself a decade ago when I skipped CUDA class during college time.
deltaburnt 10 minutes ago

This is a JAX article, a parallel computation library that's meant to abstract away vendor specific details. Obviously if you want the most performance you need to know specifics of your hardware, but learning the high level of how a GPU vs TPU works seems like useful knowledge regardless.
the__alchemist an hour ago

The principles of parallel computing, and how they work at hardware and driver levels are more broad. Some parts of it are provincial (Strong province though...), and others are more general.
It's hard to find skills that don't have a degree of provincialism. It's not a great feeling, but you more on. IMO, don't over-idealize the concept of general-knowledge to your detriment.
I think we can also untangle the open-source part from the general/provincial. There is more to the world worth exploring.
physicsguy 5 hours ago

It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.
- Q6T46nT668w6i3m an hour ago
  
  I agree that “learning CUDA wasn’t particularly difficult to get started,” there are Grand Canyon sized chasms between CUDA and its alternatives when attempting to crank performance.
hackrmn 3 hours ago

I grew up learning programming on a genuine IBM PC running MS-DOS, neither of which was FOSS but taught me plenty that I routinely rely on today in one form or another.
saagarjha 6 hours ago

Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.
qwertox 5 hours ago

It's a valid point of view, but I don't see the value in sharing it.
There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.
It reads a bit like "You shouldn't use it because..."
Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?
- woooooo 4 hours ago
  
  It's a useful bit of caution to remember transferrable fundamentals, I remember when Oracle wizards were in high demand.
  
  sigbottle 3 hours ago
  
  There are tons of ML compilers right now, FlashAttention brought back the cache-aware model to parallel programming, Moore's law hit is limit and heterogeneous hardware is taking taking off.
  Just some fundamentals I can think of off the top of my head. I'm surprised people saying that the lower level systems/hardware stuff are untransferable. These things are used everywhere. If anything, it's the AI itself that's potentially a bubble, but the fundamental need for understanding performance of systems & design is always there.
  
  woooooo 2 hours ago
  
  Im actually doing a ton of research in the area myself, the caution was against becoming an Nvidia expert narrowly rather than a general low level programmer with Nvidia skills included.
  
  NikolaNovak 2 hours ago
  
  I mean, I'm in Toronto Canada, a fairly big city and market, and have an open seat for a couple of good senior Oracle DBAs pretty much constantly. The market may have reduced over decades but there's still more demand than supply. And the core DBA skills are transferable to other RDBMS as well. While I agree that some niche technologies are fleeting, it's perhaps not the best example :-)
  
  woooooo 2 hours ago
  
  That's actually interesting! My experience is different, especially compared to the late 90s and early 00s, most people avoid Oracle if they can. But yes, its always worth having someone who's job is to think about the database if it's your lynchpin.
  
  kjellsbells an hour ago
  
  Well, there's the difference. Maybe demand has collapsed for the kind of people who knew how to tune the Oracle SGA and get their laughable CLI client to behave, but the market for people who structurally understood the best ways to organize, insert and pull data back out is still solid.
  Re Oracle and "big 90s names" specifically, there is a lot of it out there. Maybe it never shows up in the code interfaces HNers have to exercise in their day jobs, but the tech, for better or worse, is massively prevalent in the everyday world of transit systems and payroll and payment...ie all the unsexy parts of modern life.
Philpax 6 hours ago

There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.
- dotancohen 6 hours ago
  
  I've been hearing that for over a decade. I can't even name off hand any CUDA competitors, none of them are likely to gain enough traction to upset CUDA in the coming decade.
  
  Philpax 6 hours ago
  
  Hence the "if" :-)
  ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.
  Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.
  
  Q6T46nT668w6i3m 2 hours ago
  
  Look at the state of PyTorch’s CI pipelines and you’ll immediately see that ROCm is a nightmare. Especially nowadays when TPU and MPS, while missing features, rarely create cascading failures throughout the stack.
  
  physicsguy 5 hours ago
  
  I still don't see ROCm as that serious a threat, they're still a long way behind in library support.
  I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.
  
  einpoklum 2 hours ago
  
  Talking about hardware rather than software, you have AMD and Intel. And - if your platform is not x86_64, NVIDIA is probably not even one of the competitors; and you have ARM, Qualcomm, Apple, Samsung and probably some others.
  
  sdenton4 2 hours ago
  
  ...Well, the article compares GPUs to tpus, made by a competitor you probably know the name of...
bee_rider an hour ago

I think I’d rather get familiar with cupy or Jax or something. Blas/lapack wrappers will never go out of style. It is a subset of the sort of stuff you can do on a GPU but it seems like a nice effort:functionality reward ratio.
moralestapia 2 hours ago

It's money. You would do it for money.
WithinReason 6 hours ago

What's in this article would apply to most other hardware, just with slightly different constants
rvz 2 hours ago

> I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors.
Better not learn CUDA then.
amelius 5 hours ago

I mean it is similar to investing time in learning assembly language.
For most IT folks it doesn't make much sense.

hackrmn 3 hours ago

I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):

> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"

It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.

And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.

My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.

Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.

I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".

I can always ask $AI to do the equivalent for me, which is a tragedy according to some.

evertedsphere 2 hours ago

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
should have most of it
einpoklum 2 hours ago

> It's not clear from the above what a "CUDA core" (singular) _is_
A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.
For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077

nickysielicki 10 hours ago

The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck 9 hours ago

It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

gregorygoc 9 hours ago

It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

hackrmn 3 hours ago

Plenty of circumstantial evidence pointing to the fact NVIDIA prefers to hand out semi-tailored documentaion resources to signatories and other "VIPs", if not the least to exert control over who and how uses their products. I wouldn't put it past them to routinely neglect their _public_ documentation, for one reason or another that makes commercial sense to them but not the public. As for incentives, go figure indeed -- you'd think by walling off API documentation, they're shooting themselves in the feet every day, but in these days of betting it all on AI, which means selling GPUs, software and those same NDA-signed VIP-documentation articles to "partners", maybe they're all set anyway and care even less for the odd developer who wants to know how their flagship GPU works.
threeducks 7 hours ago

With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.
KeplerBoy 41 minutes ago

Nvidia has ridiculously good documentation for all of this compared to its competitors.

radarsat1 an hour ago

Why haven't Nvidia developed a TPU yet?

physicsguy 8 hours ago

It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

business_liveit 30 minutes ago

so, Why didn't Nvidia developed a TPU yet?

einpoklum 2 hours ago

We should remember that these structural diagrams are _not_ necessarily what NVIDIA actually has as hardware. They carefully avoid guaranteeing that any of the entities or blocks you see in the diagrams actually _exist_. It is still just a mental model NVIDIA offers for us to think about their GPUs, and more specifically the SMs, rather than a simplified circuit layout.

For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.

aanet 12 hours ago

Fantastic resource! Thanks for posting it here.

akshaydatazip 8 hours ago

Thanks for the really thorough research on that . Right what I wanted for my morning coffee

tucnak 7 hours ago

This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

namibj 2 hours ago

Do TPUs allow having a variable array dimension at somewhat inner nesting level of the loop structure yet? Like, where you load expensive (bandwidth-heavy) data in from HBM, process a variable-length array with this, then stow away/accumulate into a fixed-size vector?
Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently.
An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer.

porridgeraisin a day ago

A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.

camel-cdr a day ago

SIMT is just a programming model for SIMD.
Modern GPUs still are just SIMD with good predication support at ISA level.
- achierius 8 hours ago
  
  That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
  
  camel-cdr 7 hours ago
  
  I'm not aware of any GPU that implements this.
  Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
  Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
  if (theradIdx.x < 4) { A; B; } else { X; Y; } Z;
  The diagram shows how this executes in the following order:
  Volta:
  ->| ->X ->Y ->Z|-> ->|->A ->B ->Z |->
  pre Volta:
  ->| ->X->Y|->Z ->|->A->B |->Z
  The SIMD equivilant of pre Volta is:
  vslt mask, vid, 4 vopA ..., mask vopB ..., mask vopX ..., ~mask vopY ..., ~mask vopZ ...
  The Volta model is:
  vslt mask, vid, 4 vopA ..., mask vopX ..., ~mask vopB ..., mask vopY ..., ~mask vopZ ...
  [1] https://chipsandcheese.com/i/138977322/shader-execution-reor...
  [2] https://stackoverflow.com/questions/70987051/independent-thr...
  
  namibj 3 hours ago
  
  IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
  Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.
  
  adrian_b 7 hours ago
  
  "Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
  "Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
  SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
  What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
  Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
  What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.
  
  HumanOstrich 4 hours ago
  
  "threads"
- porridgeraisin a day ago
  
  I was referring to this portion of TFA
  > CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
  
  adrian_b 7 hours ago
  
  This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
  For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).

evrennetwork 4 hours ago

[dead]

tomhow 8 hours ago

Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

radarsat1 2 hours ago

A comment from there:
> There are plans to release a PDF version; need to fix some formatting issues + convert the animated diagrams into static images.
I don't see anything on the page about it, has there been an update on this? I'd love to put this on my e-reader.