Message boards : Number crunching : Client Errors
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next
Author | Message |
---|---|
A.M. Send message Joined: 13 Jun 06 Posts: 12 Credit: 954,586 RAC: 0 |
If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying. Actually, going to try a WU from the new Rosetta 3.26 first. Then we'll see about Ralph. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying. I was going to try Ralph@Home, but when I went to add it as a BOINC project, it wasn't in the list. Is there is a trick to finding it, or has it been temporarily removed because Rosetta 3.26 just got release and there's nothing to test? |
AlphaLaser Send message Joined: 19 Aug 06 Posts: 52 Credit: 3,327,939 RAC: 0 |
If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying. You can attach to it manually by entering the URL in the box after clicking Attach to project: http://ralph.bakerlab.org/ However they do not have very much work right now. |
A.M. Send message Joined: 13 Jun 06 Posts: 12 Credit: 954,586 RAC: 0 |
If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying. Sigh. Same crap as before. https://boinc.bakerlab.org/rosetta/result.php?resultid=497264507 |
AlphaLaser Send message Joined: 19 Aug 06 Posts: 52 Credit: 3,327,939 RAC: 0 |
A) Anyone using the HP S1931, 2031, 2231, 2331 monitor series? My affected host is a laptop (Dell XPS), every once in awhile I connect a second monitor (actually a Samsung TV) via HDMI but I'm pretty sure I get errors without it connected.
Nope, only one discrete GPU inside the laptop, a Geforce 435M. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
My full rig description from the factory invoice is as follows: INTEL, Core™ i7-2700K Quad-Core 3.5 - 3.9GHz TB, HD Graphics 3000, LGA1155, 8MB L3 Cache, 32nm, 95W, EM64T EIST HT VT-x XD, Retail INNOVATION COOLING, Diamond 7 Carat Thermal Compound, Electrically Non-Conductive CORSAIR, H100 Hydro CPU Liquid Cooling System, Socket LGA2011/1155/1156/1366/775/AM3/AM2, Retail ASUS, Maximus IV Extreme-Z, LGA1155, Intel® Z68, DDR3-2200 (O.C.) 32GB /4, PCIe x16 SLI CF /1+2*, SATA 3Gb/s RAID 5 /4, 6Gb/s /4, USB 3.0 /8, HDA, BT, GbLAN /2, FW /2, ATX, Retail G.SKILL, 16GB (4 x 4GB) Ripjaws PC3-10600 DDR3 1333MHz CL7 (7-7-7-21) 1.5V SDRAM DIMM, Non-ECC EVGA, GeForce® GTX 560 Ti 822MHz, 2GB GDDR5 4000MHz, PCIe x16 SLI, 2x DVI + mini-HDMI, Retail CREATIVE, Sound Blaster® X-Fi Titanium Fatal1ty Champion, 7.1 channels, 24-bit 192KHz, I/O Module, PCIe x1 WESTERN DIGITAL, 2TB WD Caviar® Black™ (WD2002FAEX), SATA 6 Gb/s, 7200 RPM, 64MB Cache CRUCIAL, 64GB M4 SSD, MLC Marvell 88SS9174, 500/95 MB/s, 2.5-Inch, SATA 6 Gb/s, Retail CRUCIAL, 64GB M4 SSD, MLC Marvell 88SS9174, 500/95 MB/s, 2.5-Inch, SATA 6 Gb/s, Retail RAID, No RAID, Independent HDD Drives PLEXTOR, PX-B320SA Black 8x/16x/48x BD/DVD/CD, Blu-ray Disk™ Combo Drive, SATA, Retail PLEXTOR, PX-LB950SA 12x/16x/48x BD/DVD/CD Blu-ray Disc™ Burner w/ Lightscribe, SATA, Retail COOLER MASTER, HAF X (RC-942-KKN1) Black Tower Case w/ Window, EATX, 9 Slots, No PSU, Steel/Plastic CUSTOM WIRING, Standard Wiring with Round Cables CORSAIR, CMPSU-1200AX Gold AX1200 Power Supply w/ Modular Cables, 1200W, 80 PLUS®, 24-pin ATX12V v2.31 EPS12V 2.92, Multi-GPU Ready MICROSOFT, Windows 7 Professional 64-bit Edition w/ SP1, OEM A. Nope, I'm using a LG flat panel, which W7 called a "generic PnP monitor" B. Yes, I'm using a true DVI connection to my monitor C. No, I have a single EVGA NVidia 560 ti graphics card installed. D. n/a E. n/a |
AlphaLaser Send message Joined: 19 Aug 06 Posts: 52 Credit: 3,327,939 RAC: 0 |
If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying. I just got the same: https://boinc.bakerlab.org/result.php?resultid=497320429 |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
I have received one WU at ralph and crunched it successfully!! http://ralph.bakerlab.org/result.php?resultid=2647221 Regards, Rayburner |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Is there anyone with our problem that does not have all of the following attributes: 1) one or more NVIDIA GPU's, & 2) running Win7 64-bit, & 3) Intel I7 processor ? |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Some additional tests have shown that when my machine was generating good work units with "only one NVIDIA installed", only a subset of the NVIDIA drivers were really installed. Noteworthy among the missing was the CUDA driver. I'm not saying this driver is at the root of the problem, only that the full complement of drivers was not present for successful WU's. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
I have received one WU at ralph and crunched it successfully!! Hi Rayburner: Thank you for some positive sounding news! I wonder if you have fiddled with the drivers or anything else since you last had a WU ending in client error? Perhaps you could give Rosetta another try and see what happens. Thus far I have not tried Ralph, but will give it a shot. Kimsey, Jr. |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
I have received one WU at ralph and crunched it successfully!! Nothing was changed on my side. After the successfull run on ralph I tried rosetta again. Unfortunately still with the known client error outcome. That lets me assume there must be a difference between ralph and rosetta. Maybe project admins can have a look at possible differences at the server side. Regards, Rayburner |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Questions for Rocco or other Rosetta Staff Members: I am planning a new test for which I would appreciate any tips that might be helpful for planning and execution. First of all I think we have all been assuming there is a problem on the client side output from Rosetta or BOINC. What if for example one of the output files that's uploaded to the server at the completion of a WU has a slightly different output format, say an unexpected space or extra character caused by an NVIDIA driver. When the uploaded files are processed by the server, a read statement in the processing software chokes on the extra character and we never see the Rosetta version number for the WU in "Task Details". Am I correct that the server uses our uploaded files as an input to generate the "Task Details" HTML files that we view on the web for each WU? If so, do we know if "Exit status = 0 (0x0)" being already in the preliminary version of the file is there by default at the end of a failed WU, or did our "client error" WU's really successfully complete, but part of the server side software failed in generating the HTML output? Enough rambling about my theories. The goal of the test is to run the same WU twice, first with the machine in a bad state, and a second time with the machine configured for success. Here are the proposed test steps: 1) Determine from Rosetta staff exactly which files are uploaded to the server following execution of a WU and where exactly they reside while waiting to upload. 2) Configure my problem machine so that it will successfully complete WU's without client error. 3) Pick a particular work unit and when it is at about the 50% completion mark, terminate BOINC/Rosetta to force it to be saved at a breakpoint. 4) Copy and save the entire /ProgramData/BOINC directory. 5) Unplug the network cable, restart BOINC/Rosetta, and let the chosen WU complete successfully. 6) Terminate BOINC/Rosetta, again forcing existing WU's into a second breakpoint condition. 7) Copy & save the output files for the chosen WU (note that it cannot upload because the network cable is unplugged). 9) Configure the machine so that subsequent WU's will end with client errors. 8) Restore the entire BOINC directory to the state it was in at the first breakpoint. 10) Let the same WU chosen earlier complete again, but with client errors. 11) Collect the same set of output files before they are uploaded to the server. 12) Use file comparison software to identify differences. Once the differences are identified, it should be much easier to find a solution. Comments/suggestions please? KMF, Jr. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I like the sounds of your plan. You have described just about the only way to assure you actually run exactly the same task, and version, and host ID, and random number seed (they are embedded with the task) more than once. My only suggestion would be that if you are going to go to all of the trouble, to go ahead and do what you've outlined with several tasks at the same time. Which might mean you would want to increase your number of days between network connections configured in your BOINC network preferences. You can go ahead and backup BOINC either before tasks start, or as you described at a checkpoint. Either way, when your restore, the status of the task will be as it was when you did the backup. You can see what the output file name (just one for Rosetta I believe) for a task will be by reviewing the client_state.xml file in your BOINC data directory (which is shown at the top of the event log as BOINC starts up). The file name is described in two parts, you will see a task which identifies a "<result_name>", then you will see that given result name identified later with "<file_info>". But those result names should be under your BOINC data directory when the task completes. I don't recall if they remain in the slots directory of the task they were produced by, or if they all get to a higher level directory. I think they do occupy one of the slots directories until they are uploaded and confirmed via update project (which BOINC does for you periodically, or you can do manually from the projects tab). Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I guess further, I'd suggest taking a backup of the entire BOINC data directory at the point where the tasks have completed crunching. That way you could actually compare everything else as well, not just the output file. There would be some differences expected, such as whereever the event log is stored, all of the timestamps on messages would be different, etc. But perhaps you turn up a difference in a configuration file of some kind or something that reflects the detected hardware on the machine. Rosetta Moderator: Mod.Sense |
A.M. Send message Joined: 13 Jun 06 Posts: 12 Credit: 954,586 RAC: 0 |
Well. Ralph ran 9 WUs (so far) to completion. Successfully. One thing I did notice, and am looking into now, is that on Ralph, I'm using the default run-time target, whereas I was using different times on Rosetta. In the interest of eliminating another variable, I'm attempting another Rosetta task with the run-time set to the default. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Well. Ralph ran 9 WUs (so far) to completion. Successfully. Just to make sure I'm understanding this correctly ... the same version of the client software (3.26?) is currently being run on both Rosetta and Ralph. On Rosetta, WU's error out if GPU processing is enabled, but on Ralph, they don't. If I've correctly stated the situation, then it sounds like the problem isn't with either the Rosetta client or the NVidia drivers, it's with the Rosetta server. Maybe it's time to look at the Rosetta server software? |
A.M. Send message Joined: 13 Jun 06 Posts: 12 Credit: 954,586 RAC: 0 |
Well. Ralph ran 9 WUs (so far) to completion. Successfully. That does seem to be the essence of the situation, yes. I have now successfully completed 18 WUs from Ralph, while 3 more from Rosetta have failed. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Well. Ralph ran 9 WUs (so far) to completion. Successfully. Checked you one computer with 4 tasks completed. It's weird, you have this status at the end: DONE :: 2 starting structures 7788.38 cpu seconds This process generated 2 decoys from 2 attempts Which is good. But why they are invalid is weird. They ran ok. They shut down at the end of your time limit. Also what is weird is that Mac completes the same task ok, but your Win7 machine does not at least according to Rosetta's computer system. |
A.M. Send message Joined: 13 Jun 06 Posts: 12 Credit: 954,586 RAC: 0 |
Checked you one computer with 4 tasks completed. Yes, weird. We all agree. |
Message boards :
Number crunching :
Client Errors
©2024 University of Washington
https://www.bakerlab.org