giuseppebilotta

@giuseppebilotta@fediscience.org

Researcher @ INGV-OE. Opinions my own.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

giuseppebilotta, 3 months ago to GraphicsProgramming

It's official now: I hate #NUMA and variable frequency.

This is re: https://fediscience.org/@giuseppebilotta/111818775682992930

It might just be that I'm more proficient analyzing and working around #GPU quirks (happens, when you do mostly #GPGPU for more than a decade) than #CPU, but there's so many weird things happening on this machine that I don't know where to start from.

reply

expand (19)

collapse (19)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 3 months ago

Just to mention one: why is it that the performance per core when using #OpenMP drops by 40% when switching from 1 to 2 threads, but only when using OMP_PROC_BIND=close and not when using OMP_PROC_BIND=spread? If anything I'd expect the reverse.

And then adding more threads gives me almost perfect scaling, at least up to 16 threads, before dropping again … WTH is happening here? Honestly wouldn't mind some #fediHelp with suggestions on what to look at/for … #HPC #askFedi do your magic!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 3 months ago

@jannem hyperthreading is enabled. Your remark is definitely on point. The obvious question then is: is there a way to ask OpenMP which cores the threads get bound to and to prefer the same socket but not core siblings? Do I have to play with OMP_PLACES manually?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@jannem our code is still sufficiently memory-bound that it can benefit from SMT after running out of physical cores, although the benefits of using the extra hardware threads isn't as impressive. I've done some additional tests with OpenMP verbose output and the issue is definitely that it goes for the next thread rather than the next core. So now I have an additional dimension (OMP_PLACES) to play with in my tests. Which is fine, except I wish I knew before. Hate having to rerun all tests.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@rmsilva @jannem thanks for the hints. I've started playing around with the settings and it might turn out that for best performance I might have to manually set OMP_PLACES to the sequence I want. That doesn't bring me to peace with NUMA+OpenMP, but at least gives me some breathing space to cope ;-)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@reuterbal @rmsilva @jannem
good point. There are parts of the algorithm that are definitely memory-bound, but the main part shouldn't be. I'll have to double check this after making sure I run the test I think I'm running, rather than something split cross two sockets + SMT. Still, might be worth looking into how much bandwidth I have available and how it's being used. Recommendations on a quick tool to assess both close and far memory bandwidth? (This is an AMD EPYC, in case it matters.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@hattom @jannem
thanks for the hint. Luckily this is OpenMP on bare metal so I have as much control as I want. I hadn't thought about using LIKWID for pinning too (I only used it to measure the CPU frequency during the program runs). At the moment I'm playing around with OMP_PLACES, which in combination with OMP_PROC_BIND has some unexpected behavior at times, so making sure I test what I really want to test is … less trivial than I expected.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@hattom @jannem

at the moment I'm more interested in producing reliable benchmark results for an article I'm writing, optimizations will come later, but at least for the benchmarks I have to make sure I'm measuring what I think I'm measuring 8-) At least I think I've found the right places+bind incantation to make sure the threads go on the cores I expect them to go into … hopefully this will have less unpredictable results.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

Biggest thanks to all who replied, the #Fediverse never ceases to amaze me with its friendliness and helpfulness. I've now started running the new batch of tests. I'll be posting updates for the curious. Let's see if we can scale to at least 32 threads without issues, maybe push it to 64. I expect things to go really sour the moment we cross the socket barrier.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 3 months ago

@hattom it's a dual-socket AMD EPYC 7713, with 2 NUMA nodes, one per socket. I'm aware that with the CCX structure the situation in each socket is a bit more complicated than that, but I expect this to only have a lower-order impact on the whole thing.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@hattom thanks. As I mentioned, at the moment I'm mostly interested in (properly measured) performance numbers. After this round of benchmarks is done and published, it'll be time to look into optimizations, and these kind of details will be very useful.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ProjectPhysX, 3 months ago to linux

Software should always "just work". To make compiling #FluidX3D easier, I made the compile script smarter: it now automatically detects operating system (#Linux / #macOS / #Android), #X11 support on Linux, and if GNU make is installaled. 🖖🧐
https://github.com/ProjectPhysX/FluidX3D/commit/f990dfbe3f7a922d1cb6523e8e0b8e6d6cf8c905

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@ProjectPhysX that's an interesting approach to the build process. I thought we were being “effective” by only requiring (GNU) make, but I see that you don't even require that.

(BTW you could allow users to set target before running ./make if you only do the uname check if the variable isn't set by the user, allowing them to do an override with something like target=Linux ./make.sh to avoid building X11 support even if it would be detected otherwise.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago to lemmy

Does anybody know of #Lemmy communities/magazines centered around any of these topics:

#CFD aka #ComputationalFluidDynamics

#SPH aka #SmoothedParticleHydrodynamics

The closest I can find is @fluidmechanics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago to random

Of course the code performs very differently when compiled with GCC vs when compiled with Clang … Wanna guess I'm going to have to run these tests with both compiler? (On the one hand, more material for the manuscript, on the other, several more days of simulation …)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago to hpc

Kind of surprised there is no .hpc gTLD

#HPC #Internet

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beatnikprof, 4 months ago to academia

Student: How can I improve my participation score?
Me: Participate.
🤦‍♂️
#professor #academia #academicchatter #academicmastodon

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ giuseppebilotta

giuseppebilotta, 4 months ago

@Zitzero @beatnikprof
it also kind of depends on the professor, not all the ones that claim to want more participation actually appreciate it, or not all in the same form. But e.g. if the students are struggling with the material (or the way it's presented) and nobody speaks up, how will the professor know? (Sometimes the vacant look in the students' eyes helps, but …)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago to random

Well, this is Not Good™, scaling one of the test cases up in size leads to random segfaults due to NaNs appearing out of nowhere during integration. And it seems to be hardware related since the same exact test case produces the expected result everywhere else.

If I have to RMA this machine I'm going to scream.

reply

expand (14)

collapse (14)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

Debugging this NaN is interesting because I'm finding a lot of unrelated kinks that are easy to smooth out, but unrelated.
It's also frustrating because getting the data needed for the debugging takes FOREVER.
OTOH it's nice to leverage the debugging infrastructure I built in GPUSPH.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Oh wow I got it. There's an integer overflow somewhere.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

This was a fascinating bug to track down. The key issue is that even if the number of particles is limited to UINT_MAX, the neighbors list needs larger-than uint indexing. This isn't new, and was already taken into account in the code by having a specific type used to index the neighbors list, which is 64-bit. However, somewhere in the guts of the code there were a couple of places where it was forgotten during the transition, and it was tripping in this case.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

So why didn't we spot it until now? After all, even with the standard neighbors list size of 128 elements per particle max it should trip whenever there are more than ~16M particles per device. The thing is, to actually trip the bug in a way that doesn't self-compensate you need a combination of specific geometries and specific way of indexing the neighbors list.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

To improve performance, the way the neighbors list is stored is different on GPU, that uses an interleaved layout (first neighbor of all particles followed by second neighbor of all particles etc), whereas the CPU uses the more classic layout of all neighbors of the first particle followed by all neighbors of the second particle etc (think row- vs column-major for 2D arrays).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

This particular set of tests had the combination of all the elements needed to trip: large number of particles, larger-than normal neighbors list (320 neighbors per particle), and using the CPU backend.

On the upside, the CPU backend usage was also what made debugging easier (for appropriate definitions thereof), especially once the reason for the bug was clear: building with clang -fsanitize=integer and finding the places where this caused issues.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

Interestingly, there were a couple more places where the integer UB was triggered (esp. concerning the choice of signed vs unsigned). So I took the opportunity to iron those out too.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ProjectPhysX, 11 months ago to github German

#FluidX3D has passed 2000 Stars! It is the most popular #CFD software on #GitHub now! 🖖😊⭐️
https://github.com/ProjectPhysX/FluidX3D
Feeling blessed that my work is useful to so many people across the globe, with users in 75 countries already! 🌍
42% EU, 30% Americas, 25% Asia, 3% Oceania+Africa

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@ProjectPhysX that's almost a vertical line, congratulations! (How do you get the history of stars?)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 3 months ago

@ProjectPhysX oh nice, thanks.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...