Minirosetta 3.73-3.78

Message boards : Number crunching : Minirosetta 3.73-3.78

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

AuthorMessage
Tero

Send message
Joined: 22 Jul 17
Posts: 1
Credit: 939,545
RAC: 0
Message 87554 - Posted: 22 Oct 2017, 13:03:22 UTC

I seems that version 3.78 broke compatibility with the Linux client. After Minirosetta 3.78 update, tasks started to fail with "computation error". Latest version of the "regular" Rosetta works fine. I run CentOS linux 7.3 with client 7.6.22. It seems that the error is with how the new version handles files:

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: src/apps/public/boinc/minirosetta.cc line: 195

(Example workunit 853521226)

There is a database zip-file, but it's name is minirosetta_database_d0bf94b.zip. If I make a copy of the zip file to minirosetta_database.zip, I get file errors like "ERROR: ERROR: Option file open failed for: 'flags_rb_10_11_78082_120670__t000__0_C1_robetta'" (workunit 854223185). That file was present in the project folder.
ID: 87554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile planetclown

Send message
Joined: 27 Jan 12
Posts: 5
Credit: 13,101,172
RAC: 8,983
Message 87776 - Posted: 30 Nov 2017, 11:59:01 UTC
Last modified: 30 Nov 2017, 12:14:12 UTC

Hello, I'm occasionally seeing two different errors on the following apps:

    Rosetta Mini v3.78 x86_64-pc-linux-gnu
    Rosetta Mini v3.78 i686-pc-linux-gnu


I've seen it on Lubuntu and Linux Mint (both Ubuntu 16.04/Xenial) along with BOINC 7.6.31. Link to computer.

The first error is glibc detected with free(): invalid pointer

BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu: free(): invalid pointer: 0x13867fb8 ***
======= Backtrace: =========
[0xdf36941]
[0xdf3a45b]
[0xede768c]
[0xdeffb51]
[0x81630ad]
[0xd45eb92]
[0xd45ebcb]
[0xd465336]
[0xd46ca67]
[0xd46feef]
[0xd474232]
[0xd400a01]
[0xd40c69a]
[0xc9ac83d]
[0xc9ad47f]
[0xca8b53f]
[0xb08de97]
[0xb265920]
[0xb2a83b6]
[0xb29f4d2]
[0x8aaae73]
[0x8aae71d]
[0x8ab361b]
[0x8a925f9]
[0x8a65a47]
[0xb371855]
[0xb3743be]
[0xb434b13]
[0xb43119d]
[0x8a6fa23]
[0x8056303]
[0xdf0cfd8]
[0x8048131]
======= Memory map: ========
08048000-0ede4000 r-xp 00000000 08:05 1183736                            /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu
0ede4000-0edec000 rw-p 06d9c000 08:05 1183736                            /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu
0edec000-0f115000 rw-p 00000000 00:00 0 
10d45000-17e18000 rw-p 00000000 00:00 0                                  [heap]
ebd2d000-f2cd4000 rw-p 00000000 00:00 0 
f305c000-f3d64000 rw-p 00000000 00:00 0 
f4200000-f4221000 rw-p 00000000 00:00 0 
f4221000-f4300000 ---p 00000000 00:00 0 
f517e000-f517f000 ---p 00000000 00:00 0 
f517f000-f5e8f000 rw-p 00000000 00:00 0 
f5e8f000-f7667000 rw-s 00000000 08:05 1581177                            /var/lib/boinc-client/slots/11/boinc_minirosetta_11
f7667000-f7668000 ---p 00000000 00:00 0 
f7668000-f766b000 rw-p 00000000 00:00 0 
f766b000-f766d000 rw-s 00000000 08:05 1581173                            /var/lib/boinc-client/slots/11/boinc_mmap_file
f766d000-f776a000 rw-p 00000000 00:00 0 
f776a000-f776c000 r--p 00000000 00:00 0                                  [vvar]
f776c000-f776e000 r-xp 00000000 00:00 0                                  [vdso]
ffc6c000-ffc8e000 rw-p 00000000 00:00 0                                  [stack]

</stderr_txt>
]]>


The second error is SIGSEGV: segmentation violation
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
SIGSEGV: segmentation violation
Stack trace (4 frames):
[0xde75dcf]
[0xf77ceca0]
[0xdf36358]
[0xeffb51ff]

Exiting...

</stderr_txt>
]]>


I haven't seen any errors while running Rosetta v4.06 app or other BOINC projects. Any help would be appreciated. Thank you!
ID: 87776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87802 - Posted: 3 Dec 2017, 16:09:12 UTC - in response to Message 87800.  

Sorry to say that but your crappy Ryzen is the problem


Ryzen is crappy? Are you a troll?
ID: 87802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87805 - Posted: 4 Dec 2017, 11:16:18 UTC - in response to Message 87803.  

Either the application crashes outright with a segmentation fault, or the C library kills it because it detected an invalid pointer, this way preventing a possible segfault. If you think about it there must also be cases where an invalid pointer goes unnoticed but doesn't cause a segfault.

If you have a invalid pointer in your sw it's your problem, not a cpu problem.

Search for "kill_ryzen" or "marginality error" and you'll find many reports on Ryzens segfaulting in a particular use case: massive parallel compiler runs on Linux. An extreme scenario, but not unrealistic, and there's no excuse for simply crashing.

Problem solved months ago, with free replaces of early Ryzen and with bios update (agesa 1.0.0.6b).
ID: 87805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 87806 - Posted: 4 Dec 2017, 12:11:26 UTC - in response to Message 87805.  

Problem solved months ago, with free replaces of early Ryzen and with bios update (agesa 1.0.0.6b).

I purchased a Ryzen 1700 made in week 33 of 2017, so it is a fixed version. It is on an ASRock Fatal1ty X370 Gaming X motherboard with the agesa 1.0.0.6b BIOS, and with 32 GB of Patriot DDR4 memory (15-15-15-36).

The CPU is not overclocked, and runs Ubuntu 17.10. I just started running Rosetta on 15 cores, with the other core supporting a GTX 970 on Folding. Previously, it had been running WCG for about a month with no errors, but that is too easy.
https://boinc.bakerlab.org/rosetta/results.php?hostid=3299745

In addition to errors, I am interested in the output. These are the 24-hour work units, and I was averaging about 800 points each on an i7-3770 (7 cores, with one reserved for a GPU, also on Ubuntu) for those that ran the full 24 hours.

We will see how it goes.
ID: 87806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 47,239
Message 87808 - Posted: 4 Dec 2017, 14:05:01 UTC

You can RMA segfault Zen chips.

http://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug
ID: 87808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87814 - Posted: 4 Dec 2017, 17:26:01 UTC - in response to Message 87813.  
Last modified: 4 Dec 2017, 17:34:03 UTC

That's not a solution, it's an emergency measure. And of course I expect it to be free. Good thing this option exists though. But in this RMA process they'll ask you to run tests and document them with photos. Believe it or not, I have no means to take photos, so no RMA for me.


There is a radical solution: pass to Windows 10. Problem goes away :-P
Or wait 4.06 become the default application.
ID: 87814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87815 - Posted: 4 Dec 2017, 17:40:30 UTC - in response to Message 87813.  

Problem solved months ago
I'm not aware of an official statement saying the problem's been identified, let alone solved. Care to give me a pointer? *giggles*

New "RMA Ryzen" has not this problem, so they find it and resolve...
ID: 87815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 87833 - Posted: 6 Dec 2017, 3:26:21 UTC - in response to Message 87806.  
Last modified: 6 Dec 2017, 3:32:32 UTC

We will see how it goes.

I have gotten rather poor performance with the Ryzen 1700, somewhat less output per core than an i7-3770, and three errors. But I have now disabled SMT in the BIOS. There we some problems with that early on with Ryzen, and maybe Rosetta does not work well with it on AMD. So I am now running Rosetta on 7 full cores, with one core reserved for the GPU. I will run it for about two or three more days to see.
ID: 87833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,994
Message 87863 - Posted: 8 Dec 2017, 20:11:15 UTC - in response to Message 87813.  

If you have a invalid pointer in your sw it's your problem, not a cpu problem.
I'm missing the word "because" in that sentence.

I just saw a similar problem but under Windows 10 and on an Intel CPU.

7H2LD3_51C703_fold_and_dock_SAVE_ALL_OUT_538615_1685
https://boinc.bakerlab.org/workunit.php?wuid=864346673

Rosetta Mini 3.78

64-bit Windows 10
Intel i7-5950X, 32 GB, SSD

Perhaps someone could check if it's the same problem, but under conditions much less likely to have the problem become visible.
ID: 87863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 87874 - Posted: 9 Dec 2017, 20:17:14 UTC - in response to Message 87833.  
Last modified: 9 Dec 2017, 21:06:40 UTC

I have gotten rather poor performance with the Ryzen 1700, somewhat less output per core than an i7-3770, and three errors. But I have now disabled SMT in the BIOS. There we some problems with that early on with Ryzen, and maybe Rosetta does not work well with it on AMD. So I am now running Rosetta on 7 full cores, with one core reserved for the GPU. I will run it for about two or three more days to see.

After disabling SMT in the BIOS on my Ryzen 1700 machine (Ubuntu 17.10), I have obtained the following results, which are slightly complicated:
https://boinc.bakerlab.org/results.php?hostid=3299745

Good News: No more errors, with 31 work units being completed successfully. This compares with 3 errors out of 21 work units when SMT was enabled.
Bad News: The output, as measured by the credits is still quite low even on full Ryzen cores (running 7 cores, with the other one dedicated to a GPU) when you are running only Rosetta (but see below).

And the credits are all over the place. Just considering the Rosetta mini 3.78 that ran the full 24 hours, they range from 178 to 815 (except for the last, at 1160 points), and averaged 337 points. That seems to be about the same (per core) as with SMT enabled and running Rosetta on 15 cores, so enabling SMT should at least increase the total output, even with errors.

However, in neither case is the Ryzen as good a the i7-3770 (with hyperthreading). I get no errors on 3.78, and credits average around 800 points per work unit running with 7 cores. I see no advantage to Ryzen thus far as compared to Ivy Bridge if you run only Rosetta.

But the Ryzen 1700 does much better on WCG (running mainly MCM and MIP, with a few of the others). There I get no errors, and twice the output of the i7-3770. So there is something wrong with how the Rosetta AMD app runs on Ryzen. I hope they can fix it, as I will probably be converting most of my machines to AMD eventually.

And, in another twist, the last of the Rosettas did quite well at 1160 points. That was because as I was finishing the Rosettas, I allowed the WCG work units to run. Therefore, when most of the cores were running WCG, the last Rosetta got very good points (though the very last of the 3.78 got stuck and I had to abort it).

Moral: Until they fix Rosetta to run properly on Ryzen, it would be best to mix Rosetta with something else on the majority of the cores (WCG works). You will probably need to experiment to find out what works best though.

=====================================================================================================
Work units that ran the full 24 hours (3.78 only) run with SMT disabled (running on 7 full cores):

Returned 9 Dec:
1160.19
 815.46
 187.98 	
 186.55 	

Returned 8 Dec:
 815.21
 178.20
 184.03
 182.54 	
 747.96
 796.89

Returned 7 Dec:
 184.87
 187.17 	
 186.49
 182.87 
 183.08
 183.50 	
 181.75

Ave: 337 points (excluding the last work unit at 1160 points).

NOTE: very little difference in credits per core with SMT enabled (but twice the number of cores).


Addendum: I don't know how 4.06 Rosetta runs on Ryzen, except that the points are lower as compared to 3.78 Rosetta mini. But how it runs on an Intel chip is another matter.
ID: 87874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87879 - Posted: 10 Dec 2017, 19:16:38 UTC - in response to Message 87874.  

Error after 5 hours.... 958310977
-529697949 (0xE06D7363) Unknown error code
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x740B08B2

Engaging BOINC Windows Runtime Debugger...

ID: 87879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile planetclown

Send message
Joined: 27 Jan 12
Posts: 5
Credit: 13,101,172
RAC: 8,983
Message 87976 - Posted: 30 Dec 2017, 17:51:00 UTC - in response to Message 87803.  
Last modified: 30 Dec 2017, 17:52:56 UTC

Ryzen is crappy? Are you a troll?
Yes. No. You don't seem to own a Ryzen. I do.

Let me give some brief information about my current computers. One Ryzen 7 1700, right now showing 516 valid tasks and 60 errors. And one FX-8320E, 67 valid and 1 error. I can assure you the Ryzen behaves exactly as planetclown describes. Either the application crashes outright with a segmentation fault, or the C library kills it because it detected an invalid pointer, this way preventing a possible segfault. If you think about it there must also be cases where an invalid pointer goes unnoticed but doesn't cause a segfault. The result could be anything. I wouldn't rely on a Ryzen for something important, let's hope this project's validator is good. If you dig through the project's host list you'll find more Ryzens showing these symptoms, the most obvious running Linux, but also some Windows hosts with a high number of access violations that could be related.

Also as planetclown describes, the errors don't seem to happen with the new Rosetta application and not at other projects, so you could be tempted to dismiss this as an application error in Rosetta Mini. But there's at least one other example of spontaneous segfaults on Ryzens. Search for "kill_ryzen" or "marginality error" and you'll find many reports on Ryzens segfaulting in a particular use case: massive parallel compiler runs on Linux. An extreme scenario, but not unrealistic, and there's no excuse for simply crashing. People there claim you're safe if you don't do that kind of thing, but without arguments, and Rosetta proves them wrong.

So there's at least two completely unrelated cases of several Ryzens segfaulting out of the blue and no valid reason to assume thats's all. In other words, those things can unpredictably crash for unknown reasons and if they don't crash you still can't trust the results. Crap.

Just want to reply with an updated status to my SEGFAULT issues with Ryzen 7 1700. I was able to reproduce the segmentation faults using the “kill ryzen” test. I also got a replacement Ryzen through AMD’s RMD process. It took about a week from when I mailed it back to when I received the replacement.

My original CPU had a manufacture date in the 21st week of the year, the replacement in the 39th week (where it’s believed chips produced in the 25th or prior weeks may have issues). I have now completed 97 Rosetta Mini v3.78 tasks on linux without a single error. It appears RMDing the Ryzen was the solution. Thank you floyd for providing information on the issues that people have been having with the Ryzen chips!

Results from my desktop with the latest Ryzen:
https://boinc.bakerlab.org/results.php?hostid=3297625&offset=0&show_names=0&state=0&appid=4[/url]
ID: 87976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 87977 - Posted: 30 Dec 2017, 19:09:24 UTC

963219454

ERROR: get_jump_that_builds_residue: not build by a jump!
ERROR:: Exit from: ......srccorekinematicsFoldTree.cc line: 394
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

ID: 87977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 88006 - Posted: 3 Jan 2018, 13:02:23 UTC - in response to Message 87800.  
Last modified: 3 Jan 2018, 13:02:34 UTC

Sorry to say that but your crappy Ryzen is the problem.


Oh, boys, it's not a bug, it's a feature. :-P
Intel security patch
ID: 88006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,390,338
RAC: 19,707
Message 88042 - Posted: 9 Jan 2018, 18:21:07 UTC
Last modified: 9 Jan 2018, 18:25:28 UTC

Looks kike something wrong with rb_01_08_.... series of WUs on minirosetta 3.78. (rb_01_08_77806_122534__t000__2_C1_SAVE_ALL_OUT_IGNORE_THE_REST_541301_331_0 latest example)

i have seen some of these tasks consuming huge amount of RAM - it start from standard 200-400 Mb range but at same point can hoard up to 1400-1800 Mb per task. May be even more - it crashed due to out of RAM (8 GB RAM + 4 GB page/swap file on 6-core CPU)
ID: 88042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88043 - Posted: 9 Jan 2018, 19:38:06 UTC - in response to Message 88042.  
Last modified: 9 Jan 2018, 20:03:21 UTC

i have seen some of these tasks consuming huge amount of RAM - it start from standard 200-400 Mb range but at same point can hoard up to 1400-1800 Mb per task.

I have five on Windows 7 64-bit (i7-4771), and six on Ubuntu 16.04 (i7-3770) ranging from 1 to 19 hours with no problems yet, but I will keep an eye on them. If they blow up, it must be late in the run.

EDIT: By the way, I see you are using AMD CPUs. I got poor performance on my Ryzen 1700 on Rosetta, as I reported earlier. I wonder if they need to recompile it to fix this problem too?
ID: 88043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 88084 - Posted: 17 Jan 2018, 3:57:06 UTC

Boinc 7.83 recent Mini-rosetta 3.78 error
nRoCM_01_P05055_group0_congq_SAVE_ALL_OUT_IGNORE_THE_REST_541727_1334_0
ERROR: ERROR: reading of AtomPair failed.

ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 559
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

ID: 88084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,390,338
RAC: 19,707
Message 88088 - Posted: 17 Jan 2018, 7:44:06 UTC - in response to Message 88043.  
Last modified: 17 Jan 2018, 7:45:14 UTC

I do not see such memory leaks any more lately too.

About AMD CPU performance - I do not know. I do not have any latest AMD CPUs (from Ryzen family) yet.
I am still using older CPUs: one Phenom II X6 and two FX-8320 (Vishera/Piledriver), And I have not seen any performance issues with these older AMD CPUs in Rosetta: they almost on par with corresponding (from same Generation/age and same core number) Intel CPUs.
ID: 88088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 88130 - Posted: 20 Jan 2018, 22:48:04 UTC

I've recently begun having this issue with my host XP running Pentium 4 CPU. Previously no problems, though of course slow and relatively low credits as expected. Using app 3.78 windows_intelx86. Workunit 872559942 - Task 967645181

01/20/2018 12:57:30 PM | Rosetta@home | Computation for task rb_01_17_79431_122764__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_542014_553_0 finished
01/20/2018 12:57:30 PM | Rosetta@home | Output file rb_01_17_79431_122764__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_542014_553_0_r1092951988_0 for task rb_01_17_79431_122764__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_542014_553_0 absent
01/20/2018 12:57:51 PM | | Suspending computation - CPU is busy
01/20/2018 12:58:01 PM | | Resuming computation
01/20/2018 12:58:17 PM | Rosetta@home | Sending scheduler request: To report completed tasks.
01/20/2018 12:58:17 PM | Rosetta@home | Reporting 1 completed tasks
01/20/2018 12:58:17 PM | Rosetta@home | Not requesting tasks: don't need (job cache full)
01/20/2018 12:58:26 PM | Rosetta@home | Scheduler request completed

Exit status -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0121B939 read attempt to address 0x39D5626C

Same errors for Workunit 872559856 Task 967645069.
No point in me continuing to run Rosetta on this host if this situation continues, as able to run SETI@home without issue with it.
ID: 88130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

Message boards : Number crunching : Minirosetta 3.73-3.78



©2024 University of Washington
https://www.bakerlab.org