Message boards : Number crunching : Client Errors
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, server configuration may play a role and should be reviewed. Ralph may also be sending a different type of work unit to the ones that fail on Rosetta. One reason Ralph may send work even with the version is already released on Rosetta is that there are new types of work units that are tested. So, perhaps something within the work units is now improved that causes them to work. Or, perhaps the type of work differs from what fails on your environment. Rosetta Moderator: Mod.Sense |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
Hi people, I don't have time to read all what's been written since my previous post but i just wanted to let you know that after installing the new stable boinc 7.0.25 on my iMac I resumed rosetta to give it a new try and WU are running fine now. They still credit some very small amounts, but it works fine ;) (and I don't do it for credits of course :) ) Well I realize that the other information is that I upgraded from Snow Leopard to Lion just after I posted (on the 24/03), so there may be a link also... |
Rocco Moretti Send message Joined: 18 May 10 Posts: 66 Credit: 585,745 RAC: 0 |
Well. Ralph ran 9 WUs (so far) to completion. Successfully. ?!?!? Did not expect that. Ralph and Rosetta@home are intended to be basically the same. Ralph just gets applications and new jobs slightly before Rosetta does, so we can hopefully avoid pushing bad jobs/applications to Rosetta@home. The whole point of Ralph is things which would give errors on Rosetta@home would show errors on Ralph first. Maybe it's time to look at the Rosetta server software? From what I understand, the Ralph@home and Rosetta@home server back-end is running the same version of the software (there's differences in the versions of the web page software, but that shouldn't affect the result reporting). That doesn't mean there isn't some slight change in configuration which could be causing this interaction. We'll take a look at the servers and see if we can figure out what the difference is. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Sounds like the most promising lead we've had so far ... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The big picture sounds like results files are being corrupted in potentially trivial ways between the client and the validation process. Would it be possible to have the wrap-up processing in the Rosetta client calculate a checksum or MD5 and store it in the result file? That way if the file does not validate properly, the checksum of the current, server copy of the file (except of course for the added checksum itself) can be compared against the stored checksum to confirm the data. That would essentially prove if some change to the file occurs. From that point it is a matter of tracking down WHERE that change occurs. Rosetta Moderator: Mod.Sense |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
|
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Is this the same problem they were having at Einstein@home? Described here No, it's a different one. Ours involves the WU running to completion, only to fail on a validation error from the server. It's only been observed on machines which are also running GPU applications from other projects, using a NVidia graphics card. |
Sky King Send message Joined: 28 Feb 12 Posts: 11 Credit: 15,912 RAC: 0 |
[quote]Is this the same problem they were having at Einstein@home? It's only been observed on machines which are also running GPU applications from other projects, using a NVidia graphics card. I do want to clarify one thing... a previous post said that the error only occurs when GPU processing is enabled, and the quoted post implies that it only happens if you are using your GPU for other projects. I don't believe either of those are true. Like most that are having this problem, I have an i7, 64 bit Windows 7, and an EVGA-branded NVIDIA 560... However, both the 560, and the new instance of Windows 7 on my machine have not been used for any projects, not even F@H or other non-BOINC, since installation. I believe the determinant is simply whether or not you have the NVIDIA driver installed... Whether you are using it for anything more than just a display adapter appears to be immaterial. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,187,038 RAC: 3,476 |
Is this the same problem they were having at Einstein@home? Described here It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors, the ONE machine without a crunching gpu in it returned units just fine. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
I agree with this statement. I cannot speak for the AMD's. Sorry for the delay with the test I proposed several days ago, but it's tax time in the US. The test runs are nearly complete, and I hope to analyze the results this weekend. |
Rocco Moretti Send message Joined: 18 May 10 Posts: 66 Credit: 585,745 RAC: 0 |
Is this the same problem they were having at Einstein@home? As wbblakemore says, the symptoms are different, but there might be some commonalities. If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before. BTW, I've taken a look at the difference between the Ralph and Rosetta@home servers, and can't see anything which would obviously cause a difference, but there is a slight compiler setting difference, so I'm looking into whether that might have some bearing on the issue. It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors Was that the same error issue? (All workunits consistently get listed as a Client Error outcome and have a missing application version, but according to stderr out exit successfully and have an exit status of 0). If I remember correctly, you were having a slightly different issue, primarily with the CASP9/hybridize workunits. (For what it's worth, to the best of my knowledge the bug which caused those errors has been fixed with the 3.26 release.) |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before. I'm sorry, but I keep coming back to one basic thought -- if the CLIENT is somehow corrupting data files (and the client software is presumably identical across servers, including compiler options), why is one server processing the data files correctly and the other server isn't? I'd want to be looking at the server environment, especially any dynamic link libraries, to compare versions and possible changes -- and if the servers are Windows based, whether they are both running with the same update for .NET. Microsoft is infamous for breaking things with updates. As an alternative, is there a possibility that the client compiled on different servers is somehow being linked with different versions of system routines and/or toolboxes? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,187,038 RAC: 3,476 |
Is this the same problem they were having at Einstein@home? I have not crunched here for a couple of weeks but I think ver 3.26 was what I was using when all but one of my machines had their problems! It should be in my stats, to me they have been archived but I am sure you can manually check them. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,677,569 RAC: 10,479 |
Is this the same problem they were having at Einstein@home? 3.26 was released last week (on the 5th according to the 3.26 thread) so that might have fixed your problems ;) |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,187,038 RAC: 3,476 |
I have not crunched here for a couple of weeks but I think ver 3.26 was what I was using when all but one of my machines had their problems! It should be in my stats, to me they have been archived but I am sure you can manually check them. I will try it again then as soon as I reach my goal on another project I am currently working on. |
AlphaLaser Send message Joined: 19 Aug 06 Posts: 52 Credit: 3,327,939 RAC: 0 |
Just got a batch of Ralph WUs and I can confirm that my host has been able to successfully complete Ralph while failing here at Rosetta. I set the runtimes to be 1 hr on both projects. Host at Rosetta: https://boinc.bakerlab.org/results.php?hostid=1455479 Host at Ralph: http://ralph.bakerlab.org/results.php?hostid=27840 |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Here are the proposed test steps: The test to run a WU to both a failed and a successful state of completion is done. I generally followed the procedure, but with modifications suggested by Mod.Sense (thank you for the very useful info). The primary WU data files that are generated by my computer look about the same except with some "time description" differences (delta times, not clock times). I don't know if that is significant. The file name BOINC/client_state.xml had generally the same information for both the failed and successful runs. One difference was the file size and MD5 were different, but only due to the difference in the time description values. I looked at other file types in the collection, but couldn't find anything interesting. Rocco's earlier findings from looking at my save BOINC directory didn't pick up any irregularities either. Most of the run details you see reported on the web for each work unit are being extracted from the client_state.xml file. There's only one of these files in the BOINC directory, and it contains details for all of the different WU's you are running or have waiting in the queue. The raw data file for each WU lives in the directory BOINCprojectsboinc.bakerlab.org_rosetta and is named with the long (alphanumeric) name of the WU. It has no file extension, so you need to add ".GZ" if you want to unzip it to see the data. The file is usually deleted after it uploads soon after completion of a WU. The point of this is that what you see on the WU web page primarily comes from the client_state.xml file and not from the raw data file. This begs the questions: 1) is the raw data file getting uploaded at all? 2) is it being modified/damaged during transfer? 3) will failing the MD5 check cause client errors? 4) what server-side software writes the phrase "client error" and what triggers that error? 5) where does the WU web page get the Rosetta version number? If you are interested in looking at the details of the analysis, the pertinent information for one WU I looked can be found in this zip file. If you want additional data files from the BOINC directory, please let me know. ---KMF, Jr. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Just got a batch of Ralph WUs and I can confirm that my host has been able to successfully complete Ralph while failing here at Rosetta. I set the runtimes to be 1 hr on both projects. I installed Ralph@Home several days ago, but it won't give me any tasks. I've tried the PROJECT, UPDATE button many times. Any suggestions? ---KMF, Jr. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I installed Ralph@Home several days ago, but it won't give me any tasks. I've tried the PROJECT, UPDATE button many times. Any suggestions? ---KMF, Jr. Right, Ralph only issues work periodically when a new version or new work units need testing. Just have to wait for some to be available. BOINC retries periodically for you. Rosetta Moderator: Mod.Sense |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
[ I just wanted to tell you how much your hard work is appreciated by all of the rest of us who are suffering from this bug. It's really above and beyond the call of duty, and I take my hat off to you for doing it. |
Message boards :
Number crunching :
Client Errors
©2024 University of Washington
https://www.bakerlab.org