Today 3 wu's failed with Unhandled Exception Detected...

Message boards : Number crunching : Today 3 wu's failed with Unhandled Exception Detected...

To post messages, you must log in.

AuthorMessage
alex

Send message
Joined: 21 Dec 14
Posts: 8
Credit: 2,669,706
RAC: 29
Message 90157 - Posted: 7 Jan 2019, 11:15:47 UTC

Hi,
normally Rosetta runs fine on my PC's, but today 3 wu' failed with very similar errors
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0127C0E8 write attempt to address 0x00ACE12C

Engaging BOINC Windows Runtime Debugger...
https://boinc.bakerlab.org/result.php?resultid=1050398575
https://boinc.bakerlab.org/result.php?resultid=1050397973
https://boinc.bakerlab.org/result.php?resultid=1050392563

As I'm looking through my results, i found another pc too with errors of the same kind. Looks like a system problem, not a pc problem.
ID: 90157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90159 - Posted: 7 Jan 2019, 14:08:53 UTC - in response to Message 90157.  

Hi,
normally Rosetta runs fine on my PC's, but today 3 wu' failed with very similar errors

You are running into the same problem I had when I tried my new Ryzen 2600 on Windows 10. It did terribly; about 7 out of 8 failed.
I have now switched it to Ubuntu 18.04.1, and the first one was OK. I will get a full load of results today (I run the 24-hour work units).

In general, my Ryzen 1700 and 2700 did well on Ubuntu also, when I updated to the latest Linux kernel.
I think they have some serious fixing to do with their Windows compiler, or whatever.
ID: 90159 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
alex

Send message
Joined: 21 Dec 14
Posts: 8
Credit: 2,669,706
RAC: 29
Message 90160 - Posted: 7 Jan 2019, 16:05:44 UTC

I'ts not that clear. One Ryzen has 0 errors, another 100%
https://boinc.bakerlab.org/rosetta/results.php?hostid=3544884
https://boinc.bakerlab.org/rosetta/results.php?hostid=3258387

But yes, Intel is error free.
ID: 90160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90161 - Posted: 7 Jan 2019, 18:49:20 UTC - in response to Message 90160.  

I'ts not that clear. One Ryzen has 0 errors, another 100%

Chances are, it was some security update to Win 10 that is borking one machine. But the possibilities are endless.

Yes, Intel is error-free, on both Windows (at least Win7) and Ubuntu insofar as I have seen. But the output is inconsistent.
i7-4790 (Ubuntu 16.04.5): https://boinc.bakerlab.org/rosetta/results.php?hostid=3573441&offset=0&show_names=0&state=4&appid=
i7-8700 (Ubuntu 18.04.1): https://boinc.bakerlab.org/rosetta/results.php?hostid=3493841&offset=0&show_names=0&state=4&appid=
As you can see, the faster machine is doing worse. I have seen it do very well though, with around 1200 points per work unit, for a time.

And the Ryzen 2600 (Ubuntu 18.04.1) continues to do well: https://boinc.bakerlab.org/rosetta/results.php?hostid=3576251&offset=0&show_names=0&state=4&appid=

It is not even clear whether it is our machines, or their scoring functions. But the errors are another matter. The Rosetta people need to fix that, whatever it is.
ID: 90161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,624,867
RAC: 6,812
Message 90163 - Posted: 8 Jan 2019, 7:29:00 UTC - in response to Message 90161.  

The Rosetta people need to fix that, whatever it is.


Problems of Win10+Rosetta are not new.
4.07 app is released February 2018, so it seems that they have not haste to debug.
ID: 90163 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90177 - Posted: 9 Jan 2019, 21:39:48 UTC - in response to Message 90163.  
Last modified: 9 Jan 2019, 21:46:46 UTC

Problems of Win10+Rosetta are not new.
4.07 app is released February 2018, so it seems that they have not haste to debug.

It is curious that it works on some systems but not others.

I think another possible source of the problem is memory. It is well-known to be difficult to get reliable memory operation on the Ryzen motherboards. I have found over the past few days that even with Ubuntu, my Ryzen 2600 system will freeze up about once a day running Rosetta on all cores.

I can run that memory on WCG for months at a time with no problem. It seems that Rosetta brings out the worst in it. Reducing the memory from 2666 MHz to 2400 MHz may help. But I am ordering more that is actually on the QVL list to make sure.

On Win 10, that might produce errors in the work units rather than freezes.
ID: 90177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
San-Fernando-Valley

Send message
Joined: 16 Mar 16
Posts: 12
Credit: 143,229
RAC: 0
Message 90214 - Posted: 16 Jan 2019, 8:46:11 UTC

NO, Intel is not error free!

After aprox. 22 hours of crunching 12 out of 29 on three different INTEL rigs get following ERROR (or similar ones):


Name RK190110-A_HT_DHD_59_B_HT_DHD_51.pdb-fnd_SAVE_ALL_OUT_711702_3297_0
Workunit 947259516
Created 14 Jan 2019, 14:36:48 UTC
Sent 14 Jan 2019, 15:54:08 UTC
Report deadline 22 Jan 2019, 15:54:08 UTC
Received 15 Jan 2019, 14:46:28 UTC
Server state Over
Outcome Computation error
Client state Cancelled by server
Exit status 202 (0x000000CA) EXIT_ABORTED_BY_PROJECT
Computer ID 3584822
Run time 22 hours 51 min 13 sec
CPU time 22 hours 50 min 17 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 5.30 GFLOPS
Application version Rosetta Mini v3.78
windows_intelx86
Peak working set size 301.50 MB
Peak swap size 283.98 MB
Peak disk usage 454.88 MB

....

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75F1338D

Engaging BOINC Windows Runtime Debugger...


Intel HT off.
Plenty RAM memory.
Plenty disk space.
Plenty performance.
Just ROSETTA running.
WIN10 and WIN7 is installed.


Unless we hear from project staff what the problem is and what we can do to avoid wasting 12 times over 22 hours of time, we will stop crunching.
Maybe we are just making a "simple dumb" mistake on our side?

Looking forward to a serious reply!
ID: 90214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 90217 - Posted: 16 Jan 2019, 16:00:17 UTC

EXIT_ABORTED_BY_PROJECT
is an indication that R@h Project Team determined they should cancel work that was already released to hosts. It relates to the batch of work, not to the host processing the work.
Rosetta Moderator: Mod.Sense
ID: 90217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Today 3 wu's failed with Unhandled Exception Detected...



©2024 University of Washington
https://www.bakerlab.org