Message boards : Number crunching : Current issues with 7+ boinc client
Author | Message |
---|---|
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
First, sorry for my long hiatus. Mod.Sense recently brought this issue to our attention and I'd like to fix it as soon as possible. I have installed the latest client on a new Mac and it successfully completed a task. I'll try the other platforms. Does this issue still exist for the latest client version? Any positive input that might help us track this down is greatly appreciated. Thanks, David Kim |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Not quite sure what you're looking for here but I've been running R@h with Boinc 7 on a couple of machines without any great problems other than less frequent checkpointing: Boinc 7.031 on Mac OS X 10.6.8 and 7.0.28 on W7. |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
Hi David, welcome back. Good to see you on here again. I personally have had no problems with BOINC 7.x on Windows or Linux machines, but: 1. I only run Rosetta. 2. The machines that do have CUDA-capable GPUs are either disabled with cc_config.xml or otherwise not performing any tasks with them in BOINC. I believe it has been theorized on here in other threads that the troubles with BOINC 7 are related to running other projects with Rosetta, and/or/especially when running GPU tasks from other projects. There may be other symptoms or problems present that are not related to above. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
That's encouraging news. I'll definitely check this out, running R@h with a GPU project. Thanks for the info! There is a significant amount of "hybridize" jobs that unfortunately do not have checkpointing capabilities with greater resolution than a model yet. It will take some time to code in checkpointing for these jobs because there's a lot of information that has to be serialized but we will be working on it. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1569977 I tried all the combinations possible, downgraded BOINC (to both 32 and 64 versions), updated BIOS, ran rosetta@home EXCLUSIVELY (as in, no GPU project in parallel)... etc. All I had left to do was to reinstall the OS... but that's just ridiculous. So, I was forced to abandon rosseta (with this machine) due to this issue. Note: This machine is currently running WCG and GPUGRID at the same time with no problems. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel: What boinc client version? Do you still see this issue with the latest version? Sorry for your troubles. This is the exact issue I want to fix as soon as possible. |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
From stderr out it looks like he used 7.0.28 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536434124 and downgraded to try 6.12.34 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536537603 Too short of a runtime pref for these units? (3600 and 7200s) |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
From stderr out it looks like he used 7.0.28 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536434124 I usually let it run for 2-3 hours. While trying to troubleshoot the source of the problem I reduced the runtime to 1 hour. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel: I tried almost all versions, both 32 and 64 bit. I even ran a single WU on this 8-threaded CPU, thinking it could be the fact that running 8 WUs at the same time was the source of the problem. (Hint: it's not.) My other machines are running with BOINC version 7.X and some are even crunching with the GPU as well (like Collatz and Moo!) and have no problems. The only difference between those and this machine is the CPU (Ivy Bridge), RAM (some PC3-12800), and the NVIDIA GPU (GTX 660M). Edit: BTW, from reading all the errors people are getting, I think the source of this issue is more of a hardware "incompatibility" problem than just a pure software problem. I for instance have multiple machines with no problem, but one with the problem, the only difference is the OS (one has Win 7 Ultimate, the other Win 7 Home Premium) and the hardware. It's a really weird bug. |
Daedalus Send message Joined: 1 Aug 08 Posts: 39 Credit: 10,106,899 RAC: 508 |
Still the same problem: one WU, one error. The error not visible in the client: <core_client_version>7.0.27</core_client_version> <![CDATA[ <stderr_txt> [2012-10-18 19:30: 7:] :: BOINC:: Initializing ... ok. [2012-10-18 19:30: 7:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev50262.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/2012_10_9_mini_y001_folding.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Starting work on structure: _00001 Starting work on structure: _00002 Starting work on structure: _00003 Starting work on structure: _00004 Starting work on structure: _00005 Starting work on structure: _00006 Starting work on structure: _00007 Starting work on structure: _00008 Starting work on structure: _00009 Starting work on structure: _00010 Starting work on structure: _00011 Starting work on structure: _00012 Starting work on structure: _00013 Starting work on structure: _00014 Starting work on structure: _00015 Starting work on structure: _00016 Starting work on structure: _00017 Starting work on structure: _00018 ====================================================== DONE :: 1 starting structures 10282.1 cpu seconds This process generated 18 decoys from 18 attempts ====================================================== BOINC :: WS_max 1.51771e+82 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel: In our team (TSC! Russia) we have now 5 (five) computers (from different owners/members) with the same symptoms. Now all of them switched to other projects now (which are working successfully and without errors), as can not run R@H at all: calculations went without any errors(in local BOINC client or in logs), but after passing validator all 100% of WUs marked as invalid. One of one of these computers was attached to R@H for short time to check if errors continue or not? They continue, here 2 bad Wus for example (after which the computer was again switched to other projects): https://boinc.bakerlab.org/rosetta/results.php?hostid=1555324 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Thanks for all the info. It definitely helps. Hopefully I'll have some time to look into this further next week. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Two win7 machines, both at BOINC 7.0.36. One has problems consistently, the other works fine, consistently. See thread for details. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Like others, I'm seeing messages in the event log (Mac OS X 10.6.8/Boinc 7.0.31) reporting this error: exited with zero status but no 'finished' file Sample output Sat Nov 3 23:17:29 2012 | rosetta@home | Scheduler request completed: got 0 new tasks Sat Nov 3 23:19:04 2012 | rosetta@home | Finished download of input_hyb_al_02_bench_3slkB_yfsong.zip Sat Nov 3 23:19:47 2012 | | Suspending network activity - user request Sun Nov 4 02:53:03 2012 | rosetta@home | Computation for task Ploop4_2_abinitio_design_y465_009_60334_1680_0 finished Sun Nov 4 02:53:20 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0 using minirosetta version 341 in slot 1 Sun Nov 4 02:55:19 2012 | rosetta@home | Task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 exited with zero status but no 'finished' file Sun Nov 4 02:55:19 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Sun Nov 4 02:55:19 2012 | rosetta@home | Restarting task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 using minirosetta version 341 in slot 0 Sun Nov 4 08:36:36 2012 | rosetta@home | Computation for task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0 finished Sun Nov 4 08:36:48 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 using minirosetta version 341 in slot 1 Sun Nov 4 08:40:53 2012 | rosetta@home | Task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 exited with zero status but no 'finished' file Sun Nov 4 08:40:53 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Sun Nov 4 08:40:53 2012 | rosetta@home | Restarting task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 using minirosetta version 341 in slot 0 Sun Nov 4 08:42:18 2012 | rosetta@home | work fetch suspended by user Sun Nov 4 08:42:56 2012 | rosetta@home | task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 aborted by user Sun Nov 4 08:42:57 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 using minirosetta version 341 in slot 2 Sun Nov 4 08:43:39 2012 | rosetta@home | Computation for task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 finished Sun Nov 4 08:44:15 2012 | rosetta@home | Task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 exited with zero status but no 'finished' file Sun Nov 4 08:44:15 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Sun Nov 4 08:44:15 2012 | rosetta@home | Restarting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 using minirosetta version 341 in slot 2 Sun Nov 4 08:44:17 2012 | rosetta@home | Task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 exited with zero status but no 'finished' file Sun Nov 4 08:44:17 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Sun Nov 4 08:44:17 2012 | rosetta@home | Restarting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 using minirosetta version 341 in slot 1 Sun Nov 4 08:44:20 2012 | | Resuming network activity Sun Nov 4 08:44:20 2012 | rosetta@home | Started upload of Ploop4_2_abinitio_design_y465_009_60334_1680_0_0 Sun Nov 4 08:44:20 2012 | rosetta@home | Started upload of rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0_0 Sun Nov 4 08:44:25 2012 | rosetta@home | Finished upload of Ploop4_2_abinitio_design_y465_009_60334_1680_0_0 Sun Nov 4 08:44:27 2012 | rosetta@home | Finished upload of rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0_0 Sun Nov 4 08:44:31 2012 | rosetta@home | Sending scheduler request: To report completed tasks. Sun Nov 4 08:44:31 2012 | rosetta@home | Reporting 3 completed tasks Sun Nov 4 08:44:31 2012 | rosetta@home | Not requesting tasks: scheduler RPC backoff Sun Nov 4 08:44:35 2012 | rosetta@home | Scheduler request completed |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I get the same with Windows and BOINC 6, so I don't think it is a part of what this thread was created for. So, please open a new thread if you like, to keep the two concepts separated. I get that reported error when I shutdown my laptop with sleep or hibernate. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
I get the same with Windows and BOINC 6, so I don't think it is a part of what this thread was created for. So, please open a new thread if you like, to keep the two concepts separated. I get that reported error when I shutdown my laptop with sleep or hibernate. Well the problems thread, where I agree this post really belongs, is getting a bit unwieldy with 400+ entries. Will start a new thread on this though. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
WR-HW95 Send message Joined: 5 Jan 06 Posts: 2 Credit: 8,086,818 RAC: 0 |
My other machine Win XP,Phenom 965, GTX 275 SLI works fine when running S@H and R@h same time using Boinc 6, but this machine Win 7 Ultimate, Phenom 1090, GTX 470 + GTX 660 Ti fails every R@H work in validation using Boinc 7.0.28. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Another host resulting in nothing but client errors (from the validator, the WUs finish with no problems): https://boinc.bakerlab.org/rosetta/results.php?hostid=1577411 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
2 David E K Another thing which is wanted to draw your attention. One of the common things that common in all computers with this bug (100% error rate at validation stage), it is missing version of minirosetta in the logs. Like in example: https://boinc.bakerlab.org/rosetta/result.php?resultid=543001353 Validate state Invalid This (no version information) may be the reason that the validator mark all such WUs as invalid? Despite the fact that he was correctly calculated actually? |
Message boards :
Number crunching :
Current issues with 7+ boinc client
©2024 University of Washington
https://www.bakerlab.org