Rosetta x86 on AMD CPU

Message boards : Number crunching : Rosetta x86 on AMD CPU

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 92933 - Posted: 1 Apr 2020, 15:41:00 UTC - in response to Message 92930.  

OK, sorry, I didn't have chance to look up the specs.

Please point out the WUs to described where you felt they had problems, but had no errors.

This WU failed in the manner that shimmerfairy described with a segfault.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650

These work units fail at a little over 6 hours of their intended 8 hour run. Nothing in the stderr.txt file to indicate why.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152718
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136159909
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136160035
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136161098
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152216

That should be enough for you to look at.
ID: 92933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92944 - Posted: 1 Apr 2020, 16:04:04 UTC

OK, so signal 11 being the reported problem, in non-COVID tasks. Some rather immediately, others after many hours.

At this point, I think it best to wait for the new application version and see what any new symptoms may look like. The new version will have a number of issues addressed, but I don't have further detail to be more specific.

The fact that they don't seem to report any completed models implies it was still on the first model at the time of the failure.

Please see, Admin bcov's post about new application and creation of work units
Rosetta Moderator: Mod.Sense
ID: 92944 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 92948 - Posted: 1 Apr 2020, 16:14:34 UTC - in response to Message 92944.  

OK, I think I will just go ahead and dump the work I was sent. I was sent WAY too much on the first scheduler connection that I can't possibly finish before deadline. All the current running tasks are in EDF mode. I was going to just let the excess expire naturally and be resent.

But if there is no likely chance the majority of work will properly complete and award credit I might as well wait for the new applications to be developed that fix the cpu feature parsing correctly.
ID: 92948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Albert

Send message
Joined: 22 Mar 20
Posts: 23
Credit: 1,069,070
RAC: 99
Message 92950 - Posted: 1 Apr 2020, 16:22:33 UTC - in response to Message 92933.  


This WU failed in the manner that shimmerfairy described with a segfault.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650


The issue that shimmerfairy described is a compatibility issue specific to a missing SSSE3 instruction on AMD K10 CPUs, and wouldn't occur on your Ryzen machine.

Your work units are seg faulting. While in the case of K10, the cause of the seg fault is an invalid CPU instruction, WUs can also seg fault as a result of malfunctioning hardware.

If you haven't already done so, I would run some hardware stress tests to verify that the hardware is actually stable.
ID: 92950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92951 - Posted: 1 Apr 2020, 16:23:27 UTC - in response to Message 92948.  

Ya, sorry, but that sounds like best course ATM.
Rosetta Moderator: Mod.Sense
ID: 92951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 92958 - Posted: 1 Apr 2020, 17:34:00 UTC - in response to Message 92950.  
Last modified: 1 Apr 2020, 17:36:32 UTC


This WU failed in the manner that shimmerfairy described with a segfault.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650


The issue that shimmerfairy described is a compatibility issue specific to a missing SSSE3 instruction on AMD K10 CPUs, and wouldn't occur on your Ryzen machine.

Your work units are seg faulting. While in the case of K10, the cause of the seg fault is an invalid CPU instruction, WUs can also seg fault as a result of malfunctioning hardware.

If you haven't already done so, I would run some hardware stress tests to verify that the hardware is actually stable.

I always run stress tests on all my machines to test for stability. Many hours of stressapptest for the memory and both cpu and memory for many hours in y-cruncher and Prime95.

No issues found or errors detected. I run all my other projects just fine without errors on the host as well as all my other Ryzen hosts.
ID: 92958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 92959 - Posted: 1 Apr 2020, 17:35:34 UTC - in response to Message 92950.  

Your work units are seg faulting.

How can you state that?? I have had only one failed work unit for a segfault.

All the other errors show no reason for the error.
ID: 92959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Albert

Send message
Joined: 22 Mar 20
Posts: 23
Credit: 1,069,070
RAC: 99
Message 92961 - Posted: 1 Apr 2020, 17:51:48 UTC - in response to Message 92959.  
Last modified: 1 Apr 2020, 17:52:02 UTC

Looking at one of your work units as an example:

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: <snipped for brevity>
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>


The error is right right near the top:

<message>
process got signal 11</message>


A "Signal 11" is a seg fault.
ID: 92961 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 92967 - Posted: 1 Apr 2020, 18:30:28 UTC - in response to Message 92961.  

I was not aware a signal 11 is a segfault. I have never received that on any of my other projects. When those projects error on a segfault, they state so explicitly in the stderr.txt output. Just like the only Rosetta task I have had error with what I know to be a segfault.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650
SIGSEGV: segmentation violation
Stack trace (18 frames):
[0xde75dcf]
[0xf7f64b70]
[0xd7cf60a]
[0xc50e485]
[0xc4f28d6]
[0xc4fd52f]
[0xc96437e]
[0xc5f7215]
[0xb265724]
[0xb2a83b6]
[0xb2af655]
[0x8a87605]
[0x8a88a7c]
[0x8a4b3be]
[0x8a555d0]
[0x80548d3]
[0xdf0cfd8]
[0x8048131]

Exiting...

</stderr_txt>
]]>
ID: 92967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 15,074
Message 93156 - Posted: 3 Apr 2020, 4:19:57 UTC - in response to Message 92550.  
Last modified: 3 Apr 2020, 4:21:06 UTC

Does Rosetta obey using a max_concurrent statement in an app_config? I am having issue with out of memory issues preventing my gpu tasks from running and I am not able to well control just using the %cpu setting in Preferences.

Yes, it works fine. I am using it for ~1.5 month already after huge COVID WUs stated to pop-up hoarding RAM.

But there is a little trick because R@H have 2 different application lines (rosetta and rosetta mini) and you need to set rules for both apps or use <project_max_concurrent> option instead of just max_concurrent to set restriction on the whole project level.

Reference for all who does not know how to use app_config for such things: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13644&postid=93152#93152
ID: 93156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ivailo Bonev

Send message
Joined: 9 May 07
Posts: 15
Credit: 4,298,558
RAC: 1,044
Message 93174 - Posted: 3 Apr 2020, 7:44:07 UTC
Last modified: 3 Apr 2020, 8:05:36 UTC

I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems.
I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W).
What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s.
ID: 93174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 298
Message 93281 - Posted: 3 Apr 2020, 20:37:38 UTC

I'd like to find the post that explained the optimized changes in the new app. All I've seen is that the app is targeted at Covid-19.
ID: 93281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
entity

Send message
Joined: 8 May 18
Posts: 19
Credit: 5,968,181
RAC: 9,741
Message 93286 - Posted: 3 Apr 2020, 21:27:19 UTC - in response to Message 93174.  

I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems.
I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W).
What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s.

We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago):

"The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime.
We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises.

Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!"

It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop.
ID: 93286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ivailo Bonev

Send message
Joined: 9 May 07
Posts: 15
Credit: 4,298,558
RAC: 1,044
Message 93352 - Posted: 4 Apr 2020, 6:19:02 UTC - in response to Message 93286.  

We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago):

"The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime.
We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises.

Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!"

It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop.


Thank you for the answer and explanation, I see now usual behavior from the CPU, maybe data in the first batch with 4.12 was much more different and "hungry for L3 cache".
ID: 93352 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Rosetta x86 on AMD CPU



©2024 University of Washington
https://www.bakerlab.org