giuseppebilotta

@giuseppebilotta@fediscience.org

Researcher @ INGV-OE. Opinions my own.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

giuseppebilotta, 4 months ago to random

Well, this is Not Good™, scaling one of the test cases up in size leads to random segfaults due to NaNs appearing out of nowhere during integration. And it seems to be hardware related since the same exact test case produces the expected result everywhere else.

If I have to RMA this machine I'm going to scream.

reply

expand (14)

collapse (14)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

A “quick” memtest seem to run correctly, although I think I'll have to run a deeper one to be sure.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Of course since the problem happens only with large test cases, running with a single thread to make debugging easier means it also takes FOREVER to hit the snag, even if it's actually pretty early in the simulation.

In the mean time I'm sitting here wondering if this will turn out to be a NUMA issue while hoping I just did something silly with the code, even though this error never happens anywhere else and even lowering the thread count on this machine doesn't help.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

OK, happens with the OpenMP build even with a single thread. Let's see without OpenMP.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Same error. Well, this at the very least seems debuggable.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Oh this is getting interesting, this might be compiler shenaningans too. So: somewhere the code is producing NaNs that shouldn't be there. I have a kernel that checks for NaNs after integration. For each particle it returns early if the position and velocities are finite. Otherwise the code continues to print the information about the particles —and a large part of this code, including the printf, seems to be just optimized out. WTH?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago (edited 4 months ago)

EDIT: SOLVED.

(Was: I have a computational kernel (C++ functor) that checks if any value in a given set of arrays is NaN. If the all values are finite the functor returns (and the next value can be checked by the caller) otherwise it prints some data, updates a counter and then exists. The counter gets updated correctly, a throw in the code gets called, but the printing of data seems to be optimized out.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

OK solved, this was actually an issue with the code, a fence for very old CUDA architectures was being triggered by the CPU backend. phew Now onwards to debugging the actual issue.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago to random

One of the reviewers for the manuscript on introducing CPU support in #GPUSPH asked for scalability tests on more than 8 cores (when I originally wrote the whole thing the only decent CPU I had at hand was an AMD Ryzen 7 3700X 8-Core Processor). It's a reasonable request, so I've been running tests on the new server we got at #INGV, that sports a dual AMD EPYC 7713 64-Core Processor. The most interesting so far has been that GPUSPH does seem to scale decently, but the baseline is lower.

#HPC

reply

expand (10)

collapse (10)

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

I never really expected the “quick hack” I did to run #GPUSPH on CPU to be “state of the art” for any meaning of the word, and was actually pretty surprised myself by how good the results were with the relatively low error (confirmed my idea that good GPU designs are also good for multi-core CPUs though), and I most definitely don't expect it to scale optimally on a NUMA system with 64 cores per node (not even counting SMT here). What I'm surprised about here is the single-core performance.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

@jannem I thought as much, but my preliminary tests don't really add up: the EPYC seemed to be at 2GHz, but now I'm seeing it go to 3.something when under heavier load, so it can boost to close to the frequency I saw on the Ryzen (3.6 pretty consistently).
So now I'm wondering if misread the numbers when I started writing this thread.
I'm now running a new set of tests, more detailed and with more load, let's see how things go …

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

This is getting interesting. As suggested by @jannem https://fosstodon.org/@jannem/111818938008604521 the EPYC has a lower base clocks, so having lower baseline isn't strange. Now I'm redoing the tests collecting more data and ensuring that the setup is better determined, and it turns out that at least in the new configuration and when running at low core counts, the EPYC boosts past the 2GHz baseline up to 3.7GHz —which is what the Ryzen was running at.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

And indeed, now that I've cleaned up the test setup, the single-thread baseline for my scaling tests is now giving ~ the same performance as the Ryzen. My prediction now is that the scaling tests will show good results up to, say, 16 or 32 threads, and then start tapering off, due to the shift from boosted to baseline operating frequency (which is 46% drop, 3.7GHz down to 2GHz). Spreading the load across the sockets will help keep the scaling good, but the question now is “how much”.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

I've already completed an initial set of tests with 128 OpenMP threads and different binding configurations, and the “spread” setting consistently provides the best performance. So it would seem that paying the NUMA cost is worth it as long as it allows the cores to boost higher.

We'll see how this holds at the high-end of the scaling tests, and even more so with SMT.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Unrelated (really related), does anybody know of a (Linux, command-line) tool that allows you to monitor CPU frequency during a program execution? I'm assuming perf can probably be made to do something like that, but if anyone has a ready-made recipe (with that or any other tool) it would be greatly appreciated.

#askFedi #fediHelp #HPC

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

@AlanSill thank you! This seems to be exactly the tool I was looking for!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

Yeah, I think I'll have to plug CPU clock frequency changes in my scaling analysis, because the effect over the whole range of cores is huge, much higher than the NUMA cost. 3.6GHz vs 3.1GHz for 16 vs 8+8 threads is nearly the same ratio as 290 to 340 seconds. The clock scaling is all the damage here.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago to random

Part of my job, both as a researcher and as a professor, is to evaluate people. It''s probably the part I hate the most after all the bureaucracy. It's horrible because I know perfectly well what it's like on the other side, and I'm aware that there are lots of incidental situations that can affect the evaluation negatively. I wish I could turn every evaluation into a cooperation with an opportunity for reciprocal growth, but ultimately I still have to express my judgement.

reply

expand (6)

collapse (6)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago

Part of the issue stems from this widespread (and misplaced) attribution of personal value to such evaluations, which leads to rather problematic situations, such as student projects that are interesting and very well done, but fail to address the specific topic of the subject I'm teaching, leaving me with little or no material to assess the student's knowledge on what not only is my duty to evaluate, but I'm more strictly qualified for evaluating.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

I've taken a habit to warn my students, repeatedly and in advance, about their ambitious proposal for term projects, encouraging them to pursue those interesting ideas on their own, but warning them that they aren't the kind of projects I can use for evaluation —this at least reduces, if not avoids, unpleasant situations where this same discussion has to be done on project delivery.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

The situation is possibly worse when evaluating applicants for post-doc or simular positions, especially when trivial mistakes (like forgetting an attachment when submitting the application) can have catastrophic effects. The evaluation committee can patch over some of the mistakes (and rest assured, they will if they can), but sometimes the situation is simply unrecoverable.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

There are ethical reasons why I, to be generous, dislike this kind of gatekeeping roles (believe me, I'd rather just teach students interesting thing, and use term projects exclusively as an opportunity to practice and learn more, and guide people towards interesting research projects), but to make things worse, we can go under trial if anybody challenges the results of our evaluations.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

This means that even if we want (and can) be as positive as possible in our evaluation, it still needs to be “documentable”: we can't just make stuff up or say that “we knew that this particular candidate had potential”. We need some kind of paper trail, and even the feeblest one is better than none.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

giuseppebilotta, 4 months ago

(This rant offered by a candidate forgetting to attach their CV when submitting their candidacy to a position here. A clerical error with catastrophic consequences.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov

giuseppebilotta, 4 months ago to berlin

#SPHERIC2024 abstract submission deadline has been extended to Friday, January 26th. If you or someone you know have a result that may be of interested to the #SmoothedParticleHydrodynamics community, consider submitting it at the upcoming #SPHERIC International #Workshop in #Berlin
More information at
https://www.spheric2024.com
https://www.dive-solutions.de/spheric2024

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov