giuseppebilotta

@giuseppebilotta@fediscience.org

Researcher @ INGV-OE. Opinions my own.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

giuseppebilotta, to GraphicsProgramming

It's official now: I hate #NUMA and variable frequency.

This is re: https://fediscience.org/@giuseppebilotta/111818775682992930

It might just be that I'm more proficient analyzing and working around #GPU quirks (happens, when you do mostly #GPGPU for more than a decade) than #CPU, but there's so many weird things happening on this machine that I don't know where to start from.

giuseppebilotta,

Just to mention one: why is it that the performance per core when using #OpenMP drops by 40% when switching from 1 to 2 threads, but only when using OMP_PROC_BIND=close and not when using OMP_PROC_BIND=spread? If anything I'd expect the reverse.

And then adding more threads gives me almost perfect scaling, at least up to 16 threads, before dropping again … WTH is happening here? Honestly wouldn't mind some #fediHelp with suggestions on what to look at/for … #HPC #askFedi do your magic!

giuseppebilotta,

@jannem hyperthreading is enabled. Your remark is definitely on point. The obvious question then is: is there a way to ask OpenMP which cores the threads get bound to and to prefer the same socket but not core siblings? Do I have to play with OMP_PLACES manually?

giuseppebilotta,

@jannem our code is still sufficiently memory-bound that it can benefit from SMT after running out of physical cores, although the benefits of using the extra hardware threads isn't as impressive. I've done some additional tests with OpenMP verbose output and the issue is definitely that it goes for the next thread rather than the next core. So now I have an additional dimension (OMP_PLACES) to play with in my tests. Which is fine, except I wish I knew before. Hate having to rerun all tests.

giuseppebilotta,

@rmsilva @jannem thanks for the hints. I've started playing around with the settings and it might turn out that for best performance I might have to manually set OMP_PLACES to the sequence I want. That doesn't bring me to peace with NUMA+OpenMP, but at least gives me some breathing space to cope ;-)

giuseppebilotta,

@reuterbal @rmsilva @jannem
good point. There are parts of the algorithm that are definitely memory-bound, but the main part shouldn't be. I'll have to double check this after making sure I run the test I think I'm running, rather than something split cross two sockets + SMT. Still, might be worth looking into how much bandwidth I have available and how it's being used. Recommendations on a quick tool to assess both close and far memory bandwidth? (This is an AMD EPYC, in case it matters.)

giuseppebilotta,

@hattom @jannem
thanks for the hint. Luckily this is OpenMP on bare metal so I have as much control as I want. I hadn't thought about using LIKWID for pinning too (I only used it to measure the CPU frequency during the program runs). At the moment I'm playing around with OMP_PLACES, which in combination with OMP_PROC_BIND has some unexpected behavior at times, so making sure I test what I really want to test is … less trivial than I expected.

giuseppebilotta,

@hattom @jannem

at the moment I'm more interested in producing reliable benchmark results for an article I'm writing, optimizations will come later, but at least for the benchmarks I have to make sure I'm measuring what I think I'm measuring 8-) At least I think I've found the right places+bind incantation to make sure the threads go on the cores I expect them to go into … hopefully this will have less unpredictable results.

giuseppebilotta,

Biggest thanks to all who replied, the #Fediverse never ceases to amaze me with its friendliness and helpfulness. I've now started running the new batch of tests. I'll be posting updates for the curious. Let's see if we can scale to at least 32 threads without issues, maybe push it to 64. I expect things to go really sour the moment we cross the socket barrier.

giuseppebilotta,

@hattom it's a dual-socket AMD EPYC 7713, with 2 NUMA nodes, one per socket. I'm aware that with the CCX structure the situation in each socket is a bit more complicated than that, but I expect this to only have a lower-order impact on the whole thing.

giuseppebilotta,

@hattom thanks. As I mentioned, at the moment I'm mostly interested in (properly measured) performance numbers. After this round of benchmarks is done and published, it'll be time to look into optimizations, and these kind of details will be very useful.

ProjectPhysX, to linux

Software should always "just work". To make compiling #FluidX3D easier, I made the compile script smarter: it now automatically detects operating system (#Linux / #macOS / #Android), #X11 support on Linux, and if GNU make is installaled. 🖖🧐
https://github.com/ProjectPhysX/FluidX3D/commit/f990dfbe3f7a922d1cb6523e8e0b8e6d6cf8c905

giuseppebilotta,

@ProjectPhysX that's an interesting approach to the build process. I thought we were being “effective” by only requiring (GNU) make, but I see that you don't even require that.

(BTW you could allow users to set target before running ./make if you only do the uname check if the variable isn't set by the user, allowing them to do an override with something like target=Linux ./make.sh to avoid building X11 support even if it would be detected otherwise.)

giuseppebilotta, to lemmy

Does anybody know of communities/magazines centered around any of these topics:

aka

aka

The closest I can find is @fluidmechanics

giuseppebilotta, to random

Of course the code performs very differently when compiled with GCC vs when compiled with Clang … Wanna guess I'm going to have to run these tests with both compiler? (On the one hand, more material for the manuscript, on the other, several more days of simulation …)

giuseppebilotta, to hpc

Kind of surprised there is no .hpc gTLD

#HPC #Internet

beatnikprof, to academia
@beatnikprof@mas.to avatar

Student: How can I improve my participation score?
Me: Participate.
🤦‍♂️
#professor #academia #academicchatter #academicmastodon

giuseppebilotta,

@Zitzero @beatnikprof
it also kind of depends on the professor, not all the ones that claim to want more participation actually appreciate it, or not all in the same form. But e.g. if the students are struggling with the material (or the way it's presented) and nobody speaks up, how will the professor know? (Sometimes the vacant look in the students' eyes helps, but …)

giuseppebilotta, to random

Well, this is Not Good™, scaling one of the test cases up in size leads to random segfaults due to NaNs appearing out of nowhere during integration. And it seems to be hardware related since the same exact test case produces the expected result everywhere else.

If I have to RMA this machine I'm going to scream.

giuseppebilotta,

Debugging this NaN is interesting because I'm finding a lot of unrelated kinks that are easy to smooth out, but unrelated.
It's also frustrating because getting the data needed for the debugging takes FOREVER.
OTOH it's nice to leverage the debugging infrastructure I built in GPUSPH.

giuseppebilotta,

Oh wow I got it. There's an integer overflow somewhere.

giuseppebilotta,

This was a fascinating bug to track down. The key issue is that even if the number of particles is limited to UINT_MAX, the neighbors list needs larger-than uint indexing. This isn't new, and was already taken into account in the code by having a specific type used to index the neighbors list, which is 64-bit. However, somewhere in the guts of the code there were a couple of places where it was forgotten during the transition, and it was tripping in this case.

giuseppebilotta,

So why didn't we spot it until now? After all, even with the standard neighbors list size of 128 elements per particle max it should trip whenever there are more than ~16M particles per device. The thing is, to actually trip the bug in a way that doesn't self-compensate you need a combination of specific geometries and specific way of indexing the neighbors list.

giuseppebilotta,

To improve performance, the way the neighbors list is stored is different on GPU, that uses an interleaved layout (first neighbor of all particles followed by second neighbor of all particles etc), whereas the CPU uses the more classic layout of all neighbors of the first particle followed by all neighbors of the second particle etc (think row- vs column-major for 2D arrays).

giuseppebilotta,

This particular set of tests had the combination of all the elements needed to trip: large number of particles, larger-than normal neighbors list (320 neighbors per particle), and using the CPU backend.

On the upside, the CPU backend usage was also what made debugging easier (for appropriate definitions thereof), especially once the reason for the bug was clear: building with clang -fsanitize=integer and finding the places where this caused issues.

giuseppebilotta,

Interestingly, there were a couple more places where the integer UB was triggered (esp. concerning the choice of signed vs unsigned). So I took the opportunity to iron those out too.

ProjectPhysX, to github German

has passed 2000 Stars! It is the most popular software on now! 🖖😊⭐️
https://github.com/ProjectPhysX/FluidX3D
Feeling blessed that my work is useful to so many people across the globe, with users in 75 countries already! 🌍
42% EU, 30% Americas, 25% Asia, 3% Oceania+Africa

giuseppebilotta,

@ProjectPhysX that's almost a vertical line, congratulations! (How do you get the history of stars?)

giuseppebilotta,

@ProjectPhysX oh nice, thanks.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • mdbf
  • everett
  • osvaldo12
  • magazineikmin
  • thenastyranch
  • rosin
  • normalnudes
  • Youngstown
  • Durango
  • slotface
  • ngwrru68w68
  • kavyap
  • DreamBathrooms
  • tester
  • InstantRegret
  • ethstaker
  • GTA5RPClips
  • tacticalgear
  • Leos
  • anitta
  • modclub
  • khanakhh
  • cubers
  • cisconetworking
  • megavids
  • provamag3
  • lostlight
  • All magazines