Message boards : Number crunching : PF*_aivan_* tasks on Rosetta 4.0+ - 20% failure rate uncorrected for 3 months
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06 std::cerr: Exception was thrown: It only seemed occasional so I wasn't that bothered - it happens - but a closer examination reveals it's a bit more significant. In my current task history I'm showing 111 tasks, of which 48 are Rosetta 4.06 and 63 are mini-Rosetta. 40 of the tasks haven't reported yet as they're in my queue (I complete 24 per day so my buffer is only 1.6 days). Of the completed tasks, 20% of all Rosetta 4.06 tasks are reporting "Error while computing" with this one specific error message. Apart from that I think the only errors I get are caused by my computer crashing/locking up (I overclock and run 24/7 so the cause is more likely down to me) - maybe once or twice a month, so not significant. Can someone look into this further, seeing as I've been reporting it since October and a 20% failure rate is very high. I don't know if you're getting useful data out of it, because they're running the full 8 hours, but validation fails and no credits are awarded for ~56 hours work per week. Thanks |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06 Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874 But I have no problems on my Intel machines (i7-3770 on Ubuntu and i7-4771 on Win7 64-bit): https://boinc.bakerlab.org/show_host_detail.php?hostid=3285911 https://boinc.bakerlab.org/show_host_detail.php?hostid=3118747 I think they need to fix their AMD stuff. PS - I see a few errors on your Intel machines too that I would not expect. Are you overclocking? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06 I noticed that, but it seems to be connected with a hardware problem with early Ryzens. My issue doesn't seem to crash out - just gives an error message, runs to completion then won't validate. I'm thinking it's more a coding issue with the task rather than the machine. Though you are certainly right - it's only this machine that throws up the errors. Otherwise, though, it's my most reliable machine over time. But I have no problems on my Intel machines (i7-3770 on Ubuntu and i7-4771 on Win7 64-bit): Yup. I had just had a motherboard blow on an old Core 2 Quad and rebuilt it with an i3-8350 and both seem sweet with everything thrown at them (until I ramp the i3 up and ruin it) PS - I see a few errors on your Intel machines too that I would not expect. Are you overclocking? Neither (yet). The old laptop is on its last legs and suffering some heat-related issues and sound-chip weirdness. The i3 had a problem with its 1st few jobs because I had some corrupted downloads. Everything fine after the first 10 minutes. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta. My Ryzen 1700 is one of the "fixed" ones built after the segfault problem was solved. It works great on WCG (MCM, MIP thus far), Universe (BHspin V2), LHC/SixTrack (SSE2 and AVX), DrugDiscovery (VINA and Smina) and GPUGrid (Quantum Chemistry). So they exercise enough different parts of the chip that I know it is OK. But the problems with Rosetta aren't just with Ryzens anyway, but with all the other AMD chips that I looked at as wingmen. They all had a higher failure rate than any of the Intel chips I saw. I don't know enough to propose a fix (except maybe to recompile it), but I am sure there are plenty of people here who can suggest something. EDIT: I tried it on TN-Grid also. The fma version of "gene@home PC-IM v1.10" is faster on the Ryzen than the AVX version on my i7-4770, with no errors. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta. Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error I suppose so. I was just responding to the view that there was something wrong with Ryzens (or AMD in general). It seems like a problem with coding to me too. But people who have tried to interact with the Rosetta developers, and who know a lot more about it than I do, have not had much luck. I hope they give some consideration to this problem. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error I crunched a lot of 4.06 with my Amd Fx6300, without problems Is an OS problem? Linux, Windows?? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error Doubt it. Windows 7 Home 64bit |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
Doubt it. Windows 7 Home 64bit I have Win10 (version 1709) 64 bit. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I crunched a lot of 4.06 with my Amd Fx6300, without problems From looking at other users with AMD machines, it seems to occur on both Windows and Linux. But surely Rosetta can look at the error rates themselves. I am a bit concerned that they have not noticed it yet, or at least not commented. Or if it is wrong, they can just say so and I will look elsewhere. But I am planning a new Ryzen+ machine later this year, and if there is no new AMD application by then, why bother trying Rosetta? There are plenty of other projects for it. |
Message boards :
Number crunching :
PF*_aivan_* tasks on Rosetta 4.0+ - 20% failure rate uncorrected for 3 months
©2024 University of Washington
https://www.bakerlab.org