Message boards : Number crunching : Work being reset to 0% on restart after BOINC is shut-down
Author | Message |
---|---|
Tim Send message Joined: 5 Mar 12 Posts: 2 Credit: 3,988 RAC: 0 |
My last two tasks from Rosetta appear have a problem where they reset to 0% complete on restart after BOINC has been shut down. I have tried suspending the task before shut down as well but it still get reset. I have had this happen before. Ultimately I abort the task to get new work as eventually the task misses its deadline. Has anyone else had this problem? I am running BOINC on Windows XP SP3 and BOIC 7.0.28. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,157,235 RAC: 4,004 |
My last two tasks from Rosetta appear have a problem where they reset to 0% complete on restart after BOINC has been shut down. I have tried suspending the task before shut down as well but it still get reset. How long is it running before you shut it down? It SHOULD be checkpointing the task as intervals as it crunches and then restarting from that point when you restart the pc. Have you tried 'hibernating' the pc instead of actually shutting it down? If you are just traveling from place to place that will certainly make it restart faster, AND may keep Boinc from actually shutting down. |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
Hybrid jobs don't checkpoint more than a full model run just yet, see here. I see that all your recent tasks start with hyb and/or hybrid in the workunit names. Since your computer doesn't complete a full model run before it's shut down/rebooted or BOINC is otherwise exited, it starts over from the beginning. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Some tasks have very long running models and no checkpoints within the models. Such tasks would produce the symptoms you describe if your PC is not running BOINC for more than a few hours at a time. Or if the machine is so busy that the active tasks don't get much CPU before the machine is powered off again. Rosetta is set up in a way that if a task begins at the same starting point more than 5 times, this is detected and it is ended for you. Rosetta Moderator: Mod.Sense |
Tim Send message Joined: 5 Mar 12 Posts: 2 Credit: 3,988 RAC: 0 |
Some tasks have very long running models and no checkpoints within the models. Such tasks would produce the symptoms you describe if your PC is not running BOINC for more than a few hours at a time. Or if the machine is so busy that the active tasks don't get much CPU before the machine is powered off again. ================================================ Thanks for your reply. I can sometimes only run BOINC for a few hours. Some Rosetta tasks obviously do checkpoint as they can restart from a intermediate point while others do not. I now usually only give the tasks two chances before aborting as it is clearly a waste of time if the task gets reset. One of the respondents has pointed out that the problem lies in Hybrid tasks. I don't understand why checkpointing can not be set up in all tasks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I don't understand why checkpointing can not be set up in all tasks. It's not that checkpoints cannot be setup, simply that they have not been setup (for that type of work unit). To do so requires a considerable coding effort. All work units checkpoint at the end of each model. But if the task's first model happens to run for a long time, or perhaps all models for a given technique run longer than you tend to have BOINC active, then the task can be ended before it has reached a checkpoint. That is why the safeguard is in place to detect that this is occurring repeatedly and end the tasks for you. As new techniques are being developed, checkpoints are not the first thing you work on. Find out if the technique is delivering improved protein models, and whether you plan to be using that method for an extended period of time, and then put the time in to have more robust checkpoints. Rosetta Moderator: Mod.Sense |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
As has been mentioned above simply hibernate your computer instead of shuting it down every time. In addition to that, set BOINC to "run always" (activity menu) and "leave applications in memory while suspended" (preferences). That should solve this issue + you will be doing a lot more work. . |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
I've noted something like that months ago on my main computer that I use constantly for other stuff besides BOINC. During the course of a day, Rosetta will reset all tasks to 0 several times. All Rosetta jobs will have the same elapsed time all of a sudden. Or at least BOINC will show so. Is there any way to check if the tasks are really reset or if it's just BOINC misreporting Rosetta's activity? Here's an excerpt from the log (I have no idea if this is useful, please tell me how to proceed): 1347886337 ue 14555.607570 ct 21786.480000 fe 40000000000000 nm Ebolanator3_2hxsa_ProteinInterfaceDesign_2Sep2012_58541_61_0 et 23199.856697 1347887877 ue 14555.607570 ct 10264.040000 fe 40000000000000 nm Rossmann2x3_abinitio_SAVE_ALL_OUT_design_f147_009_59222_792_0 et 10751.042927 1347888190 ue 14555.607570 ct 10031.080000 fe 40000000000000 nm Rossmann2x3_abinitio_SAVE_ALL_OUT_design_f148_005_59224_792_0 et 10473.551056 1347888249 ue 14555.607570 ct 10788.950000 fe 40000000000000 nm Ploop6_abinitio_design_relax_y079_001_59650_79_0 et 11327.748909 1347888371 ue 14555.607570 ct 10741.460000 fe 40000000000000 nm BAAB_repeat_9_16_abinitio_SAVE_ALL_OUT_59648_1677_0 et 11237.228733 1347889892 ue 14555.607570 ct 10793.230000 fe 40000000000000 nm Ebolanator3_2w3ja_ProteinInterfaceDesign_2Sep2012_58541_64_0 et 11231.671417 1347897371 ue 14555.607570 ct 8744.043000 fe 40000000000000 nm hyb_aj_12_bench_T0552_SAVE_ALL_OUT_IGNORE_THE_REST_59340_539_0 et 9121.400715 1347898215 ue 14555.607570 ct 9572.440000 fe 40000000000000 nm Ploop6_abinitio_design_y080_001_59653_76_0 et 10024.588375 1347898610 ue 14555.607570 ct 9764.680000 fe 40000000000000 nm Rossmann2x3_abinitio_SAVE_ALL_OUT_design_f149_006_59546_696_0 et 10237.707563 1347903968 ue 14555.607570 ct 23681.470000 fe 40000000000000 nm hyb_aj_12_bench_3rdeD_SAVE_ALL_OUT_IGNORE_THE_REST_59336_539_0 et 24783.952562 1347904731 ue 14555.607570 ct 16001.430000 fe 40000000000000 nm Ebolanator3_2i24n_ProteinInterfaceDesign_2Sep2012_58541_53_1 et 16821.806150 1347906903 ue 14555.607570 ct 16101.630000 fe 40000000000000 nm Ebolanator3_1czpa_ProteinInterfaceDesign_2Sep2012_58540_64_0 et 16958.028938 1347908827 ue 14555.607570 ct 9600.411000 fe 40000000000000 nm hyb_aj_12_bench_2ltaA_SAVE_ALL_OUT_IGNORE_THE_REST_59305_546_0 et 10153.589746 1347909421 ue 14555.607570 ct 10452.020000 fe 40000000000000 nm TLUM_4_S7H53E81N60N64_1_rsmn_24023_FP2_17766_26326_03698_abinitio_59231_4279_0 et 11121.131086 1347992555 ue 14555.607570 ct 10785.750000 fe 40000000000000 nm Ebolanator3_3c6aa_ProteinInterfaceDesign_2Sep2012_58541_64_0 et 11415.956749 1347994365 ue 14555.607570 ct 10677.570000 fe 40000000000000 nm Rossmann2x3_abinitio_SAVE_ALL_OUT_design_f153_003_59554_1080_0 et 11324.382511 1347996383 ue 14555.607570 ct 9709.159000 fe 40000000000000 nm Ploop6_abinitio_design_y004_010_59651_88_0 et 10225.541660 1347996835 ue 14555.607570 ct 10256.640000 fe 40000000000000 nm Rossmann2x3_abinitio_SAVE_ALL_OUT_design_f148_006_59224_857_0 et 10807.879969 1347997785 ue 14555.607570 ct 9670.783000 fe 40000000000000 nm Ploop6_abinitio_design_y004_005_59651_76_0 et 10222.942511 1348000575 ue 14555.607570 ct 10326.690000 fe 40000000000000 nm BAABB_repeat_9_16_abinitio_SAVE_ALL_OUT_59647_2488_0 et 10843.980240 1348002870 ue 14555.607570 ct 11850.800000 fe 40000000000000 nm Ebolanator3_1i2ta_ProteinInterfaceDesign_2Sep2012_58541_68_0 et 12558.713116 1348003430 ue 14555.607570 ct 10776.490000 fe 40000000000000 nm Ebolanator3_2x30a_ProteinInterfaceDesign_2Sep2012_58541_64_0 et 11253.237648 1348330736 ue 14555.607570 ct 25658.670000 fe 40000000000000 nm hyb_aj_03_bench_3rdeD_SAVE_ALL_OUT_IGNORE_THE_REST_58731_72_1 et 26974.061079 1353151185 ue 14465.367843 ct 9715.337000 fe 40000000000000 nm Ploop5_y046_R2_abinitio_design_y002_005_64332_208_0 et 10942.212997 1353151625 ue 14465.367843 ct 9452.023000 fe 40000000000000 nm rb_11_15_34360_65675__round2_t000__0_D2_SAVE_ALL_OUT_IGNORE_THE_REST_64406_1413_0 et 10586.641685 1353157882 ue 14465.367843 ct 6237.294000 fe 40000000000000 nm rb_11_16_34322_65856__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64562_661_0 et 6574.720789 1353166160 ue 14465.367843 ct 10034.890000 fe 40000000000000 nm rb_11_16_34899_65844_t000__2lsh_2012_IGNORE_THE_REST_10_04_64537_29_0 et 10446.164080 1353166951 ue 14465.367843 ct 11673.010000 fe 40000000000000 nm rb_11_16_34903_65857__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64563_2484_0 et 12262.155226 1353167199 ue 14465.367843 ct 10237.690000 fe 40000000000000 nm rb_11_16_34322_65856__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64562_1854_0 et 10699.731470 1353175885 ue 14465.367843 ct 9769.266000 fe 40000000000000 nm Ploop5_3_3_1_abinitio_design_y029_006_63494_1212_0 et 10197.354427 1353175983 ue 14465.367843 ct 9848.686000 fe 40000000000000 nm Ploop5_y046_R2_abinitio_design_y002_004_64332_522_0 et 10328.741339 1353177064 ue 14465.367843 ct 10693.320000 fe 40000000000000 nm lp2_5_le12_sym2_abinitio_SAVE_ALL_OUT_64492_29_0 et 11222.876724 1353177444 ue 14465.367843 ct 10630.810000 fe 40000000000000 nm Ploop5_y252_abinitio_design_relax_y002_001_64421_159_0 et 11211.536705 1353178145 ue 14465.367843 ct 10107.730000 fe 40000000000000 nm abt_3FAP_1_abinitio_SAVE_ALL_OUT_64388_136_0 et 10597.171879 1353178900 ue 14465.367843 ct 9683.949000 fe 40000000000000 nm abt_1V74_1_abinitio_SAVE_ALL_OUT_64397_112_0 et 10155.511887 1353182640 ue 14465.367843 ct 9904.207000 fe 40000000000000 nm lp4_4_le12_sym2_abinitio_SAVE_ALL_OUT_64502_33_0 et 10399.881248 1353232355 ue 14465.367843 ct 10785.790000 fe 40000000000000 nm H3i-A2E2_H3i_3cnra_ProteinInterfaceDesign_20121113_64263_16_0 et 11672.858671 1353234682 ue 14465.367843 ct 10626.730000 fe 40000000000000 nm proteinG_g056_008_64408_1079_0 et 11811.475600 1353235639 ue 14465.367843 ct 10611.610000 fe 40000000000000 nm proteinG_g057_005_64543_259_0 et 11816.788033 1353236581 ue 14465.367843 ct 10401.820000 fe 40000000000000 nm proteinG_g048_006_63896_1967_1 et 11602.575133 1353247373 ue 14465.367843 ct 10783.150000 fe 40000000000000 nm H3i-A2E2_H3i_1fcya_ProteinInterfaceDesign_20121113_64263_19_0 et 11731.541395 1353247377 ue 14465.367843 ct 10700.610000 fe 40000000000000 nm proteinG_g057_004_64543_402_0 et 11693.947552 1353247435 ue 14465.367843 ct 10523.300000 fe 40000000000000 nm rb_11_16_34711_65239_h001__wnv158_IGNORE_THE_REST_05_08_64515_22_0 et 11450.796724 1353345347 ue 14465.367843 ct 10723.710000 fe 40000000000000 nm Ploop4_1_abinitio_design_y151_010_63979_1330_0 et 12008.656090 1353347613 ue 14465.367843 ct 10872.180000 fe 40000000000000 nm rb_11_16_34714_65350__t000__2_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64527_21_0 et 12202.014528 1353352143 ue 14465.367843 ct 10775.180000 fe 40000000000000 nm H3i-A2E2_H3i_3dfga_ProteinInterfaceDesign_20121113_64263_4_1 et 12135.017433 1353355434 ue 14465.367843 ct 5530.485000 fe 40000000000000 nm rb_11_19_34341_65967__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64605_544_0 et 6049.955606 1353359201 ue 14465.367843 ct 10400.840000 fe 40000000000000 nm rb_11_19_34338_65966__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64601_652_0 et 11288.596647 1353360987 ue 14465.367843 ct 10629.630000 fe 40000000000000 nm rb_11_19_34312_65953__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64603_405_0 et 11523.489429 1353432696 ue 14465.367843 ct 5513.544000 fe 40000000000000 nm rb_11_19_34351_65987__t000__2_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64625_1290_0 et 6211.555681 1353436124 ue 14465.367843 ct 7426.537000 fe 40000000000000 nm rb_11_18_34267_65927__round2_t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64630_801_0 et 8301.104070 1353438520 ue 14465.367843 ct 11323.930000 fe 40000000000000 nm rb_11_19_34322_65962__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64604_427_0 et 12383.227442 1353440933 ue 14465.367843 ct 7048.063000 fe 40000000000000 nm rb_11_18_34278_65931__round2_t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64624_1991_0 et 7824.579201 1353442612 ue 14465.367843 ct 8671.066000 fe 40000000000000 nm rb_11_19_34351_65987__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64625_562_0 et 9504.205460 1353447375 ue 14465.367843 ct 10455.090000 fe 40000000000000 nm rb_11_20_34717_65552_h003__chitin_IGNORE_THE_REST_10_05_64665_10_0 et 11180.708305 1353448097 ue 14465.367843 ct 992.213200 fe 40000000000000 nm 1D6R_zdock_1D6R_cluster_selectcst_c.2.22_SAVE_ALL_OUT_63520_5_1 et 1083.289762 1353485978 ue 14465.367843 ct 9451.897000 fe 40000000000000 nm rb_11_13_34346_65540_t000__t0737_IGNORE_THE_REST_11_04_64242_14_2 et 10276.260254 1353486342 ue 14465.367843 ct 9559.383000 fe 40000000000000 nm rb_11_13_34346_65540_t000__t0737_IGNORE_THE_REST_10_10_64242_15_2 et 10373.962673 1353493439 ue 14465.367843 ct 10501.610000 fe 40000000000000 nm rb_11_20_34717_65553_h004__chitin_IGNORE_THE_REST_07_03_64671_15_0 et 11700.280470 1353494096 ue 14465.367843 ct 9897.966000 fe 40000000000000 nm rb_11_20_34278_66066_h001__t0658_IGNORE_THE_REST_08_03_64684_32_0 et 11038.036537 1353495044 ue 14465.367843 ct 10442.350000 fe 40000000000000 nm rb_11_20_34717_65564_t000__chitin_IGNORE_THE_REST_12_05_64663_37_0 et 11694.464050 1353496485 ue 14465.367843 ct 11038.220000 fe 40000000000000 nm Ploop4_3_1_1_abinitio_design_y007_010_62603_1937_0 et 12286.699358 1353497921 ue 14465.367843 ct 10043.550000 fe 40000000000000 nm rb_11_20_34986_66089__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64888_269_0 et 11003.885393 1353504859 ue 14465.367843 ct 10361.040000 fe 40000000000000 nm rb_11_21_34987_66114_h003__kren2_IGNORE_THE_REST_12_09_65069_6_0 et 11270.589701 1353508320 ue 14465.367843 ct 8964.426000 fe 40000000000000 nm hyb_al_03_bench_3rcoB_SAVE_ALL_OUT_IGNORE_THE_REST_60713_389_0 et 9623.582802 1353528112 ue 14465.367843 ct 10438.980000 fe 40000000000000 nm rb_11_21_34987_66108_h002__kren2_IGNORE_THE_REST_09_06_64894_31_0 et 11406.213898 1353537998 ue 14465.367843 ct 10592.730000 fe 40000000000000 nm proteinG_g054_008_64273_1818_0 et 11309.517063 1353605764 ue 14465.367843 ct 10673.320000 fe 40000000000000 nm Ploop4_2_y465_abinitio_design_relax_y001_006_64694_134_0 et 11681.508596 1353609843 ue 15554.484590 ct 10097.850000 fe 40000000000000 nm rb_11_20_34718_65558_h004__eth_IGNORE_THE_REST_15_08_64678_7_0 et 11116.946295 1353613176 ue 15554.484590 ct 8228.600000 fe 40000000000000 nm hyb_al_07_bench_4d90B_SAVE_ALL_OUT_IGNORE_THE_REST_60908_486_0 et 9236.380236 1353619515 ue 15554.484590 ct 10768.360000 fe 40000000000000 nm H3i-A2E2_H3i_2vgda_ProteinInterfaceDesign_20121113_64263_34_0 et 11905.094433 1353660728 ue 15554.484590 ct 10757.110000 fe 40000000000000 nm H3i-A2E2_H3i_2qzja_ProteinInterfaceDesign_20121113_64263_45_0 et 12042.249367 1353669887 ue 15554.484590 ct 9819.655000 fe 40000000000000 nm rb_11_21_34949_66135__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_65111_340_0 et 11272.603031 |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
I don't the same elapsed time for every WU in this log. 1353618323 ue 14926.585573 ct 60986.310000 fe 40000000000000 nm rb_11_19_34322_65962__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_64604_1700_0 et 63784.277998 The green marked number is CPU time, the red marked one is run time / elapsed time. That's from this WU. What the 40000000000000 means I have no idea, it's in all my WUs as well. . |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
I just got home after 6 hours away from my comp. The Rosetta tasks have not finished, although they take 5 hours. FYI, they start with hyb_al_03, ..04, ..08 and ..09. A task starting with ab_11_29__optpps seems to run okay. Error message in Event Log: 23/11/2012 23:43:49 | rosetta@home | Task hyb_aj_09_bench_try2_3vdxC_SAVE_ALL_OUT_IGNORE_THE_REST_63271_440_1 exited with zero status but no 'finished' file Steps taken to resolve problem: I've ticked the "leave in memory" box as suggested. I've allocated more CPU resources towards BOINC. I've left the PC running for 10 hours. --> Tasks do not complete. "Time elapsed" is always identical for Rosetta tasks. System information for debugging: I use a Core i7 950, Windows 8 Pro AND Windows 7 Ultimate (DualBoot) 64-Bit. Same problem on both OSs. After rebooting the tasks would often start at 0. Now they don't even finish. Other BOINC projects currently running: Docking (CPU only) and Poem (CPU and CPU/GPU-mix). There used to be FightMalaria for a while. My Mac and Windows 7 Professional 32-Bit don't seem to exhibit the problem. I will abort them and reset rosetta. If you need more info, do tell. I'll deliver it. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Are you rebooting to switch Windows versions when the tasks reset? If you have some Rosetta tasks running at the same time as other tasks, are you saying that all of the Rosetta tasks issue a message about not finding a 'finish' at the same time? And are the other non-Rosetta tasks effected? Have you altered any other settings? How about swap space and memory usage? Rosetta Moderator: Mod.Sense |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Error message in Event Log: This message might appear every now and than in the log, specially if the computer is busy with something else, that only should not happen too often. Steps taken to resolve problem: What are your settings for CPU usage now? Or to make it simple, post the entire global_prefs.xml if using web preferences or global_prefs_override.xml if using local preferences (both files in BOINC data directory). I've left the PC running for 10 hours. Well, if they started together and are still running that isn't very surprising... the only thing that's strange, is that (according to your task list) you are using 3 hours runtime perference, so the watchdog should kill them after 7 hours... unless they changed here something. System information for debugging: So I assume, that we are talking about this Windows 7 computer and this Windows 8 computer. The results on both this computers look OK so far, so it does not look like a general problem with rosetta. After rebooting the tasks would often start at 0. Well, it's possible that those WUs have some very long running models so they won't checkpoint untill they are finished. Are they "running" and using any CPU time? If not, restart BOINC, if they still don't use any CPU time abort them and get some other ones. . |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
Are you rebooting to switch Windows versions when the tasks reset? No, the tasks have reset themselves several times while the computer was running and I was away from it for 6+ hours. If you have some Rosetta tasks running at the same time as other tasks, are you saying that all of the Rosetta tasks issue a message about not finding a 'finish' at the same time? Yes. And are the other non-Rosetta tasks effected? No. They run normally and show normal progress. Have you altered any other settings? How about swap space and memory usage? No. The swap space and memory usage were on default settings until yesterday when I ticked the box "Leave applications in memory while suspended" |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
What are your settings for CPU usage now? Or to make it simple, post the entire global_prefs.xml if using web preferences or global_prefs_override.xml if using local preferences (both files in BOINC data directory). global_pref_override.xml <global_preferences> <run_on_batteries>0</run_on_batteries> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>1</run_gpu_if_user_active> <suspend_cpu_usage>60.000000</suspend_cpu_usage> <start_hour>0.000000</start_hour> <end_hour>0.000000</end_hour> <net_start_hour>0.000000</net_start_hour> <net_end_hour>0.000000</net_end_hour> <leave_apps_in_memory>1</leave_apps_in_memory> <confirm_before_connecting>0</confirm_before_connecting> <hangup_if_dialed>0</hangup_if_dialed> <dont_verify_images>0</dont_verify_images> <work_buf_min_days>0.100000</work_buf_min_days> <work_buf_additional_days>0.500000</work_buf_additional_days> <max_ncpus_pct>0.000000</max_ncpus_pct> <cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes> <disk_interval>60.000000</disk_interval> <disk_max_used_gb>10.000000</disk_max_used_gb> <disk_max_used_pct>50.000000</disk_max_used_pct> <disk_min_free_gb>0.100000</disk_min_free_gb> <vm_max_used_pct>75.000000</vm_max_used_pct> <ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct> <ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct> <max_bytes_sec_up>0.000000</max_bytes_sec_up> <max_bytes_sec_down>0.000000</max_bytes_sec_down> <cpu_usage_limit>100.000000</cpu_usage_limit> <daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb> <daily_xfer_period_days>0</daily_xfer_period_days> </global_preferences> Well, if they started together and are still running that isn't very surprising... I beg to differ. It is surprising because usually tasks do not start at exactly the same time because all tasks do not take equally long to finish. They should always differ by a few seconds at least when the next task starts. At least this is the case for non-Rosetta projects. (The only exception for this being when BOINC is freshly installed, when I start a new project, or if I abort all running tasks and force new tasks to download. the only thing that's strange, is that (according to your task list) you are using 3 hours runtime perference, so the watchdog should kill them after 7 hours... unless they changed here something. I am most definitely using 60 minuts runtimes (if runtimes are the same as "Switch between applications") http://img213.imageshack.us/img213/9146/20121124131908.png So I assume, that we are talking about this Windows 7 computer and this Windows 8 computer. The results on both this computers look OK so far, so it does not look like a general problem with rosetta. Yikes, there goes my anonymity. Yes, this is the computer that I am dual-booting. well, it's possible that those WUs have some very long running models so they won't checkpoint untill they are finished. Are they "running" and using any CPU time? If not, restart BOINC, if they still don't use any CPU time abort them and get some other ones. They are running and using CPU time as shown here: http://img203.imageshack.us/img203/3203/20121124132739.png I will be running only Rosetta for the next few days to see if I can find out more. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Hmm... I don't see there anything that could cause the problems. For now I'd simply suggest to observe the tasks so they are doing something, i.e. use CPU time. They have to finish at one point or they be killed by the watchdog. If they resume again, abort them and get new ones. the only thing that's strange, is that (according to your task list) you are using 3 hours runtime perference, so the watchdog should kill them after 7 hours... unless they changed here something. I mean the "target CPU run time" in your Rosetta@Home Preferences. Note that it's target CPU time, not elapsed time, so if something else is using much CPU time, the elapsed time might be a lot longer. . |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
I think I am on to something: Look what just happened in Rosetta: 24/11/2012 17:36:19 | | System clock was turned backwards; clearing timeouts 24/11/2012 17:36:19 | rosetta@home | Task Ccyst5_d3_0006_abinitio_SAVE_ALL_OUT_64221_580_1 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task Ccyst5_d19_0002_abinitio_SAVE_ALL_OUT_64225_592_1 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task hyb_aj_09_bench_try2_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_63256_1923_2 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task hyb_aj_03_bench_try2_3rj8A_SAVE_ALL_OUT_IGNORE_THE_REST_62989_1265_1 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task Ploop4_3_1_1_1_abinitio_design_y017_005_63430_1425_0 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task rb_11_24_35038_66192_h004__t0658_IGNORE_THE_REST_09_03_65499_27_0 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task hyb_al_bench_3STO_SAVE_ALL_OUT_IGNORE_THE_REST_64740_172_0 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Task Ploop4_1_abinitio_design_y151_008_63979_988_1 exited with zero status but no 'finished' file 24/11/2012 17:36:19 | rosetta@home | If this happens repeatedly you may need to reset the project. 24/11/2012 17:36:19 | rosetta@home | Restarting task Ccyst5_d3_0006_abinitio_SAVE_ALL_OUT_64221_580_1 using minirosetta version 345 in slot 15 24/11/2012 17:36:19 | rosetta@home | Restarting task Ccyst5_d19_0002_abinitio_SAVE_ALL_OUT_64225_592_1 using minirosetta version 345 in slot 17 24/11/2012 17:36:19 | rosetta@home | Restarting task hyb_aj_09_bench_try2_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_63256_1923_2 using minirosetta version 345 in slot 19 24/11/2012 17:36:19 | rosetta@home | Restarting task hyb_aj_03_bench_try2_3rj8A_SAVE_ALL_OUT_IGNORE_THE_REST_62989_1265_1 using minirosetta version 345 in slot 21 24/11/2012 17:36:19 | rosetta@home | Restarting task Ploop4_3_1_1_1_abinitio_design_y017_005_63430_1425_0 using minirosetta version 345 in slot 16 24/11/2012 17:36:19 | rosetta@home | Restarting task rb_11_24_35038_66192_h004__t0658_IGNORE_THE_REST_09_03_65499_27_0 using minirosetta version 345 in slot 20 24/11/2012 17:36:19 | rosetta@home | Restarting task hyb_al_bench_3STO_SAVE_ALL_OUT_IGNORE_THE_REST_64740_172_0 using minirosetta version 345 in slot 22 24/11/2012 17:36:19 | rosetta@home | Restarting task Ploop4_1_abinitio_design_y151_008_63979_988_1 using minirosetta version 345 in slot 18 I just witnessed the resetting to 0 % of 2 tasks, both hyb_aj_0 etc, the progress of both was around 38 % and is now back to 0. hyb_al is also back to 0. The other tasks are fine. Yesterday I've had issued with exactly the same type of WU (hyb_aj and hyb_al). Now it looks like this: See how time elapsed is exactly the same for the 3 tasks? And what does " System clock was turned backwards; clearing timeouts" mean? I got 2 more this afternoon: 24/11/2012 13:38:53 | | System clock was turned backwards; clearing timeouts 24/11/2012 15:37:25 | | System clock was turned backwards; clearing timeouts Every 59 minutes or se the system clock turns backwards. Any idea what this means? |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
See how time elapsed is exactly the same for the 3 tasks? Yes. After the crash the tasks start from the last checkpoint, which in this case means from scratch (the WUs apparenty didn't checkpoint so far). So they all start at the same time with 0 elapsed time and that leads automatically to exactly the same elapsed time for all of them. I just witnessed the resetting to 0 % of 2 tasks, both hyb_aj_0 etc, the progress of both was around 38 % and is now back to 0. The other tasks had to start from their last checkpoints, that's for sure better than from the beginning, but also there you lost some already compleated work. The hyb_aj and hyb_al WUs seem to run some long running models, so they don't checkpoint so frequently, but that's normaly not an issue. Your issue is this: 24/11/2012 13:38:53 | | System clock was turned backwards; clearing timeouts It means that Windows is synchronizing with internet time and your clock is running a bit too fast. But that should not happen every hour, more like once a week IIRC. For now you could disable that and see if it stops. . |
speechless Send message Joined: 24 Nov 11 Posts: 6 Credit: 1,079,447 RAC: 0 |
I just realized I've hijacked this thread. Apologies. Thanks for your insight concerning the time server. I have no idea how to reduce the sync interval that causes this problem. I haven't found any apparent Windows setting. The time sync setting does not allow me to choose an interval. On a related note, the hyb_a* WUs do seem to cause more grief than others. This thread here is also about issues with this kind of WU: https://boinc.bakerlab.org/forum_thread.php?id=6110&nowrap=true#74548 As for now, I will pull my dual-boot machine from Rosetta. I don't want to babysit tasks. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Thanks for your insight concerning the time server. I'd simply disable it for now and see if it helps. On a related note, the hyb_a* WUs do seem to cause more grief than others. Well, yes, it looks like some more people/computers have issues with them. I have one running right now and it runs fine so far, but didn't checkpoint since 1h15m of CPU time. Not an issue for me, but might be for others. As for now, I will pull my dual-boot machine from Rosetta. Well, Rosetta is really not the best project for systems which are rebooted often since it does not checkpoint sometimes even for several hours (over 12 hours CPU time without a checkpoint was the highest I've seen so far on my machine). . |
Tsprit Send message Joined: 12 Jul 20 Posts: 8 Credit: 5,414 RAC: 0 |
Yes... i JUST had this happen with a Task, i'm letting it do 2 tasks since my computer can handle it and both of them were like 3 hours away from being done but one of them was reset. I just had my internet malfunction there's been some disconnection issues the past 2 days but i shouldn't have lost my progress, maybe they had some issues on their end too but i'm back up and running. If that's how Data is lost then how would disconnection issues on both ends result in that? if it does at all... in all it was almost 2 and a half days of data. I've just set it so it doesn't store any additional days of work except 2 now and i thought i set it not to before but it got changed back but thankfully i can get this 1 task of 2 days of work in before the deadline tomorrow. The other one reset to 1 day of work though so maybe it lost only 1 day of work and reset to the last checkpoint a day ago oh... what a damn shame for them, Does this happen frequently for them?. Edit: Oh no nevermind they got the data i'm looking at my Tasks on my Account and it looks like it got canceled by the server then got completed and validated, so... i think they pull the plug on tasks sometimes themselves in order to collect the data it's done, maybe they decide they have enough data then just cancel or. Or perhaps they too had disconnection issues but the server has a fail safe where it collects the data you've done in order to save it. |
Message boards :
Number crunching :
Work being reset to 0% on restart after BOINC is shut-down
©2024 University of Washington
https://www.bakerlab.org