Message boards : Number crunching : invalid results; 24 hours wasted
Author | Message |
---|---|
ChristianVirtual Send message Joined: 29 Apr 17 Posts: 5 Credit: 1,684,275 RAC: 0 |
It's really frustrating to spend 24 hours of CPU cycles to get a WU invalidated https://boinc.bakerlab.org/workunit.php?wuid=897665706 https://boinc.bakerlab.org/workunit.php?wuid=897666513 Enough RAM and storage; that should not be a limit. Ryzen 1700x, Ubuntu 17.10 <core_client_version>7.11.0</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_05_08_164_241__t000__1_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_05_08_164_241__t000__1_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3291164 Starting watchdog... Watchdog active. ====================================================== DONE :: 329 starting structures 86145 cpu seconds This process generated 329 decoys from 329 attempts ====================================================== BOINC :: WS_max 4.30068e+08 BOINC :: Watchdog shutting down... 12:01:32 (3632): called boinc_finish(0) </stderr_txt> ]]> what an one do ? |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
You are suffering the same fate with your Ryzen 1700X that I encountered, and reported earlier with my Ryzen 1700 (Lubuntu 17.10). That is, low (and inconsistent) output, as indicated by the credits, along with a higher than normal error rate. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87833#87833 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874 And I am hardly alone. Others have reported similar problems with Ryzen. In fact, it seems to be an AMD problem in general insofar as I can tell, affecting most of their CPUs. So it may be that the Rosetta app is not compiled with an AMD optimized compiler, for example. But Ryzen works great on WCG and all of the other projects I have tried it on, which is a lot of them, including LHC which uses VirtualBox. So I use it for WCG, which is what I built it for originally anyway. If you want a good Rosetta machine, use Intel. And the later the Intel chip, the better. Having tried it on Ivy Bridge and Haswell, I have now found that Coffee Lake (i7-8700) gives the most consistent output (Ubuntu 18.04), though I am still in the early testing phase. https://boinc.bakerlab.org/rosetta/results.php?hostid=3399951&offset=0&show_names=0&state=4&appid= My results with Ivy Bridge and Haswell may be of some interest, though the results were somewhat inconclusive. But the i7-8700 makes them irrelevant now for me. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12544 Use your excellent Ryzen 1700X elsewhere. Maybe Rosetta will see the light and fix there stuff someday, though I don't know that they have even looked into the problem, or even consider it a problem yet. |
ChristianVirtual Send message Joined: 29 Apr 17 Posts: 5 Credit: 1,684,275 RAC: 0 |
I think you are right; I had a i7-8700 and 3930 in the past days and they had less problems. and also agree, that other projects like WCG have much less issues with Ryzen too bad for Rosetta ... |
mmonnin Send message Joined: 2 Jun 16 Posts: 59 Credit: 24,222,307 RAC: 83,030 |
Quite a few teams are ending a 3 day team event where Rosetta is the project, the Pentathlon. Errors with Rosetta app are why I select 1 hr tasks here. If its running for 6 hours then it'll prob error out anyway. I can always add more clients for more tasks if needed. |
John P. Myers Send message Joined: 13 Apr 10 Posts: 5 Credit: 860,221 RAC: 0 |
The issue may not be with Ryzen itself but with the number of threads it has. It seems anything with 16 or more threads was getting crazy high error rates, including Opterons and Xeons. I took my Xeon rig off of Rosetta for this exact reason about 2 hours after the project was announced due to the errors. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 8,196 |
The issue may not be with Ryzen itself but with the number of threads it has. It seems anything with 16 or more threads was getting crazy high error rates, including Opterons and Xeons. I took my Xeon rig off of Rosetta for this exact reason about 2 hours after the project was announced due to the errors. I was running 12 on my Linux machine and "perf top" showed that the 4.07 application's hottest code was looping in a "LOCKED" spin loop. I disassembled the binary so I could poke around. I isolated the top 5 or so hottest sections of code and they were all locked spin loops OR accessing memory following a function return. I did not see much floating point computation. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
I was running 12 on my Linux machine and "perf top" showed that the 4.07 application's hottest code was looping in a "LOCKED" spin loop. I disassembled the binary so I could poke around. I isolated the top 5 or so hottest sections of code and they were all locked spin loops OR accessing memory following a function return. I did not see much floating point computation. From the last Boinc PCM, David wrote: Rosetta@home 1 developer/programmer/tester, 2 systems administrators/engineers David Kim, Luki Goldschmidt, Patrick Vecchiato Not a lot of people to optimize/debug the code :-( |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Sounds similar to what I'm seeing. Unfortunately at this point I don't care that much, but maybe the laid back attitude is okay. Anyway, here's the properties of one of the sick tasks: Application Rosetta Mini 3.78 Name nRoCM_new_01_P04805_group0_7_congq_SAVE_ALL_OUT_IGNORE_THE_REST_609269_3 State Running Received Sat 19 May 2018 05:12:15 AM JST Report deadline Sun 27 May 2018 05:12:14 AM JST Estimated computation size 80,000 GFLOPs CPU time 00:44:04 CPU time since checkpoint 00:44:04 Elapsed time 13:13:50 Estimated time remaining --- Fraction done 6.107% Virtual memory size 451.04 MB Working set size 308.38 MB Directory slots/0 Process ID 7829 Progress rate 0.360% per hour Executable minirosetta_3.78_x86_64-pc-linux-gnu #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
ChristianVirtual Send message Joined: 29 Apr 17 Posts: 5 Credit: 1,684,275 RAC: 0 |
name rb_07_16_508_732__t000__0_C3_SAVE_ALL_OUT_IGNORE_THE_REST_682151_12503 application Rosetta created 17 Jul 2018, 13:17:32 UTC canonical result 1016016653 granted credit 238.10 minimum quorum 1 initial replication 1 max # of error/total/success tasks 1, 1, 1 errors Too many total results get other one; this time on "Too many results" ... what does that mean ? Server is handing out more and dump those who still contribute their CPU cycles ? |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I am having a stroke of good luck on both Ubuntu 16.04 (i7-3770) and Win7 64-bit (i7-4771), so I don't think a high error rate is the projects fault, at least on Intel chips. I pick up about one error a day on my Ryzen 1700, but never a long runner thus far (though I don't use it much). https://boinc.bakerlab.org/rosetta/results.php?hostid=3421421 https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747 But if you want to use Ubuntu 18.04, you have to do the fix that rjs5/juha proposed. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954 |
Message boards :
Number crunching :
invalid results; 24 hours wasted
©2024 University of Washington
https://www.bakerlab.org