Message boards : Number crunching : Rosetta x86 on AMD CPU
Previous · 1 · 2 · 3
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
OK, sorry, I didn't have chance to look up the specs. This WU failed in the manner that shimmerfairy described with a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 These work units fail at a little over 6 hours of their intended 8 hour run. Nothing in the stderr.txt file to indicate why. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152718 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136159909 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136160035 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136161098 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152216 That should be enough for you to look at. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
OK, so signal 11 being the reported problem, in non-COVID tasks. Some rather immediately, others after many hours. At this point, I think it best to wait for the new application version and see what any new symptoms may look like. The new version will have a number of issues addressed, but I don't have further detail to be more specific. The fact that they don't seem to report any completed models implies it was still on the first model at the time of the failure. Please see, Admin bcov's post about new application and creation of work units Rosetta Moderator: Mod.Sense |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
OK, I think I will just go ahead and dump the work I was sent. I was sent WAY too much on the first scheduler connection that I can't possibly finish before deadline. All the current running tasks are in EDF mode. I was going to just let the excess expire naturally and be resent. But if there is no likely chance the majority of work will properly complete and award credit I might as well wait for the new applications to be developed that fix the cpu feature parsing correctly. |
William Albert Send message Joined: 22 Mar 20 Posts: 23 Credit: 1,069,070 RAC: 99 |
The issue that shimmerfairy described is a compatibility issue specific to a missing SSSE3 instruction on AMD K10 CPUs, and wouldn't occur on your Ryzen machine. Your work units are seg faulting. While in the case of K10, the cause of the seg fault is an invalid CPU instruction, WUs can also seg fault as a result of malfunctioning hardware. If you haven't already done so, I would run some hardware stress tests to verify that the hardware is actually stable. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Ya, sorry, but that sounds like best course ATM. Rosetta Moderator: Mod.Sense |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
I always run stress tests on all my machines to test for stability. Many hours of stressapptest for the memory and both cpu and memory for many hours in y-cruncher and Prime95. No issues found or errors detected. I run all my other projects just fine without errors on the host as well as all my other Ryzen hosts. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
Your work units are seg faulting. How can you state that?? I have had only one failed work unit for a segfault. All the other errors show no reason for the error. |
William Albert Send message Joined: 22 Mar 20 Posts: 23 Credit: 1,069,070 RAC: 99 |
Looking at one of your work units as an example: <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: <snipped for brevity> Starting watchdog... Watchdog active. </stderr_txt> ]]> The error is right right near the top: <message> process got signal 11</message> A "Signal 11" is a seg fault. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
I was not aware a signal 11 is a segfault. I have never received that on any of my other projects. When those projects error on a segfault, they state so explicitly in the stderr.txt output. Just like the only Rosetta task I have had error with what I know to be a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 SIGSEGV: segmentation violation Stack trace (18 frames): [0xde75dcf] [0xf7f64b70] [0xd7cf60a] [0xc50e485] [0xc4f28d6] [0xc4fd52f] [0xc96437e] [0xc5f7215] [0xb265724] [0xb2a83b6] [0xb2af655] [0x8a87605] [0x8a88a7c] [0x8a4b3be] [0x8a555d0] [0x80548d3] [0xdf0cfd8] [0x8048131] Exiting... </stderr_txt> ]]> |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
Does Rosetta obey using a max_concurrent statement in an app_config? I am having issue with out of memory issues preventing my gpu tasks from running and I am not able to well control just using the %cpu setting in Preferences. Yes, it works fine. I am using it for ~1.5 month already after huge COVID WUs stated to pop-up hoarding RAM. But there is a little trick because R@H have 2 different application lines (rosetta and rosetta mini) and you need to set rules for both apps or use <project_max_concurrent> option instead of just max_concurrent to set restriction on the whole project level. Reference for all who does not know how to use app_config for such things: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13644&postid=93152#93152 |
Ivailo Bonev Send message Joined: 9 May 07 Posts: 15 Credit: 4,298,558 RAC: 1,044 |
I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems. I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W). What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,619 RAC: 298 |
I'd like to find the post that explained the optimized changes in the new app. All I've seen is that the app is targeted at Covid-19. |
entity Send message Joined: 8 May 18 Posts: 19 Credit: 5,968,181 RAC: 9,741 |
I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems. We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago): "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop. |
Ivailo Bonev Send message Joined: 9 May 07 Posts: 15 Credit: 4,298,558 RAC: 1,044 |
We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago): Thank you for the answer and explanation, I see now usual behavior from the CPU, maybe data in the first batch with 4.12 was much more different and "hungry for L3 cache". |
Message boards :
Number crunching :
Rosetta x86 on AMD CPU
©2024 University of Washington
https://www.bakerlab.org