OK so I'm ready for today's #GPGPU lesson with the new laptop. My only gripe for... - Random

giuseppebilotta, 7 months ago

OK so I'm ready for today's #GPGPU lesson with the new laptop. My only gripe for the lesson will be that #Rusticl in #Mesa 23.2 doesn't support #profiling information. Apparently the feature was merged at a later commit
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24101
and I even tried upgrading to my distro's experimental 23.3-rc1 packages, but trying to use rusticl on those packages segfaults. So either I've messed up something with this mixed upgrade, or I've hit an actual bug.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

giuseppebilotta, 7 months ago

Luckily I could trivially rollback to the distro's 23.2 packages, which may not have profiling wired up properly, but at least they work. I guess I'll have to wait for next year to show my students how much better the FLOSS drivers are compared to AMD's ;-)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

I'm still moderately annoyed by the fact that there's no single #OpenCL platform to drive all computer devices on this machine. #PoCL comes close because it supports both the CPU and the #NVIDIA dGPU through #CUDA, but the not the #AMD iGPU (there's an #HSA device, but). #Rusticl supports the iGP (radeonsi) and the CPU (llvmpipe), but not the dGPU (partly because I'm running that on proprietary drivers for CUDA). Everything else has at best one supported device out of three available.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

As things stand, my best chance at getting all three devices in one platform would be to add AMD support to PoCL via a HIP/ROCm driver that mimicks the existing CUDA one. Honestly, that's pretty sad.

(Arguably, Rusticl with nouveau might soon be an option, too, but having to switch between nouveau and the proprietary driver is a PITN. It would be so much better if NVIDIA supported their compute stack on top of the FLOSS stuff, like AMD does.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

So I actually tried to give the #HSA #PoCL driver a go, and while I didn't actually get support for my #AMD integrated #GPU (but it should be doable) I actually discovered something interesting, which I hadn't noticed since I didn't compare the clinfo output for my iGP between #Rusticl and the proprietary driver.

So here's the weird thing: they report a different number of compute units!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

Until recently, the number of CUs in a GPU (at least on NVIDIA and AMD devices) was very well defined: each “multiprocessor” (the equivalent of a CPU core) was a compute unit. Apparently, things have changed on recent AMD GPUs: starting from #RDNA devices, CUs are grouped into “Workgroup Processors”. Interesting, the two CUs in a WGP can act either as separate entities or as a single unit. Where this becomes particularly noticeable is in kernels that make use of the LDS.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

How did I come across all this? By going over the #HSA header files and noticing that the AMD extension to query device properties has recently introduced a new property, called “cooperative compute unit count” whose description reads:

> Some processors support more CUs than can reliably be used in a cooperative dispatch. This queries the count of CUs which are fully enabled for cooperative dispatch.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

I'm honestly a bit perplexed about the choice of wording there. That “reliably” there seems to suggest that CU mode would be … more error prone (?) than using WGP mode, in contrast to the technical documentation that only mentions performance-related aspects of the two modes …

One thing that I'm curious about is whether the #Rusticl report of 12 CUs vs 6 WGPs is a conscious choice or just something inherited from previous architectures … @karolherbst any idea on this? or who should I ask?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

karolherbst, 7 months ago

@giuseppebilotta Rusticl just uses whatever the driver advertises itself. In this case whatever radeonsi says. Radeonsi returns sscreen->info.num_cu for the compute units.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ giuseppebilotta

giuseppebilotta, 7 months ago

@karolherbst thanks for the information, so this is up to the Mesa driver to decide? Or is it even higher, at the kernel level?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

karolherbst, 7 months ago

@giuseppebilotta yes, it's up to the mesa driver.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ProjectPhysX, 7 months ago

@giuseppebilotta yes they report dual-CUs instead of CUs for some reason. Estimating TFlops/s of hardware based on reported CUs and clock frequency has required a table of device name fragments before already, cores/CU can be 0.5, 1, 8, 16, 64, 128, 192, 256.
https://github.com/ProjectPhysX/OpenCL-Wrapper/blob/master/src/opencl.hpp#L56

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ giuseppebilotta

giuseppebilotta, 7 months ago

@ProjectPhysX
to me it's a bit frustrating because until now I had a pretty good mental model of what a CU is, corresponding to a physical core on CPU and a multiprocessor (MP) on GPU: this created some pretty decent analogies between the different architectures, with SMT on CPU mapping to the MPs capacity to run multiple independent subgroups, SIMD lanes mapping to SIMT lanes (with the due differences) etc. Now things start to get fuzzier …

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

@ProjectPhysX BTW aside from me disliking your choice to use the term “core” in PhysX (sorry!), I would expect the choice to present WGPs as OpenCL CUs would actually make things easier in your case? AFAICS this was done mainly to preserve backwards-compatibility with some implicit expectations about their CU structures (such as them having 4 SIMT units per CU).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 7 months ago

@ProjectPhysX I'm also curious why you don't use vendor-specific device queries to get that extra information, both AMD and NVIDIA expose most of the information you need without a need to do pattern matching on the name.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment