Current issues with 7+ boinc client

Message boards : Number crunching : Current issues with 7+ boinc client

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile mmstick

Send message
Joined: 4 Dec 12
Posts: 8
Credit: 606,792
RAC: 0
Message 74763 - Posted: 20 Dec 2012, 16:36:00 UTC - in response to Message 74755.  
Last modified: 20 Dec 2012, 16:42:34 UTC

I had no idea this was a problem. I've been crunching with my Radeon HD 7950 in World Community Grid and POEM@Home while doing Rosetta@home tasks and never had a single problem with invalidated or errored work units; Using BOINC v7 as well.


Umm not exactly:
549205051 499214705 9 Dec 2012 22:07:32 UTC 11 Dec 2012 7:06:39 UTC Over Client error Compute error 11,468.76 79.64 ---
549203504 499213284 9 Dec 2012 21:58:59 UTC 9 Dec 2012 23:27:03 UTC Over Validate error Done 580.03 --- ---

549209311 499218156 9 Dec 2012 22:45:08 UTC 14 Dec 2012 5:32:01 UTC Over Validate error Done 215.59 --- ---

And then a ton of units 'aborted by user'. I sent as far back as the stats I can see and you only had one valid unit that you credits for. You may have had nothing but success prior to what I can see, I have no idea, but you did have some problems too.

I still think the problem is based around the gpu and it's drivers, Chilean has two things that are contradictory there...his list says:
Thu 13 Dec 2012 07:17:00 EST | | No usable GPUs found
but further down he says "Yet my NVIDIA card (which is running GPUGRID)", so either they are not from the same pc or there IS a problem someplace!


Wrong, don't try to look at my stuff; I don't run this project on anything but an old laptop. I aborted all tasks about a week ago on my desktop after I switched completely to World Community Grid because it demands my entire CPU to keep my GPU fed (Note my RAC, a high end desktop CPU would not have a low RAC). The compute error was caused from restarting the client abruptly. All work units passed successfully until I switched.

I merely ran this project for one day on my high end desktop (FX-8120@4Ghz+HD7950 in POEM/WCG). Not a single task failed.

This has nothing to do with AMD GPUs as far as I am concerned, nor do I see why it would be involved with NVIDIA GPUs.
ID: 74763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74764 - Posted: 21 Dec 2012, 6:35:28 UTC - in response to Message 74763.  

I had no idea this was a problem. I've been crunching with my Radeon HD 7950 in World Community Grid and POEM@Home while doing Rosetta@home tasks and never had a single problem with invalidated or errored work units; Using BOINC v7 as well.


Umm not exactly:
549205051 499214705 9 Dec 2012 22:07:32 UTC 11 Dec 2012 7:06:39 UTC Over Client error Compute error 11,468.76 79.64 ---
549203504 499213284 9 Dec 2012 21:58:59 UTC 9 Dec 2012 23:27:03 UTC Over Validate error Done 580.03 --- ---

549209311 499218156 9 Dec 2012 22:45:08 UTC 14 Dec 2012 5:32:01 UTC Over Validate error Done 215.59 --- ---

And then a ton of units 'aborted by user'. I sent as far back as the stats I can see and you only had one valid unit that you credits for. You may have had nothing but success prior to what I can see, I have no idea, but you did have some problems too.

I still think the problem is based around the gpu and it's drivers, Chilean has two things that are contradictory there...his list says:
Thu 13 Dec 2012 07:17:00 EST | | No usable GPUs found
but further down he says "Yet my NVIDIA card (which is running GPUGRID)", so either they are not from the same pc or there IS a problem someplace!


Wrong, don't try to look at my stuff; I don't run this project on anything but an old laptop. I aborted all tasks about a week ago on my desktop after I switched completely to World Community Grid because it demands my entire CPU to keep my GPU fed (Note my RAC, a high end desktop CPU would not have a low RAC). The compute error was caused from restarting the client abruptly. All work units passed successfully until I switched.

I merely ran this project for one day on my high end desktop (FX-8120@4Ghz+HD7950 in POEM/WCG). Not a single task failed.

This has nothing to do with AMD GPUs as far as I am concerned, nor do I see why it would be involved with NVIDIA GPUs.


There are many different system configurations. Yours apparently isn't affected by the bug, but most, if not ALL, of the systems suffering from this bug have a a GPU involved.
It shouldn't really matter anyways, it's a Rosetta@home problem, not a cruncher problem.

ID: 74764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74765 - Posted: 21 Dec 2012, 12:01:13 UTC - in response to Message 74762.  

I think I've said this before, and I'm going to say it again. :)

Since the problem seems to be with the GPU, wouldn't it be an idea to test a version that doesn't use the graphics or prepare data for the GPU? It might help.


I thought they have said they could not use the gpu because of the way the processing is done?
ID: 74765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74766 - Posted: 21 Dec 2012, 12:05:17 UTC - in response to Message 74763.  

I had no idea this was a problem. I've been crunching with my Radeon HD 7950 in World Community Grid and POEM@Home while doing Rosetta@home tasks and never had a single problem with invalidated or errored work units; Using BOINC v7 as well.


Umm not exactly:
549205051 499214705 9 Dec 2012 22:07:32 UTC 11 Dec 2012 7:06:39 UTC Over Client error Compute error 11,468.76 79.64 ---
549203504 499213284 9 Dec 2012 21:58:59 UTC 9 Dec 2012 23:27:03 UTC Over Validate error Done 580.03 --- ---

549209311 499218156 9 Dec 2012 22:45:08 UTC 14 Dec 2012 5:32:01 UTC Over Validate error Done 215.59 --- ---

And then a ton of units 'aborted by user'. I sent as far back as the stats I can see and you only had one valid unit that you credits for. You may have had nothing but success prior to what I can see, I have no idea, but you did have some problems too.

I still think the problem is based around the gpu and it's drivers, Chilean has two things that are contradictory there...his list says:
Thu 13 Dec 2012 07:17:00 EST | | No usable GPUs found
but further down he says "Yet my NVIDIA card (which is running GPUGRID)", so either they are not from the same pc or there IS a problem someplace!


Wrong, don't try to look at my stuff; I don't run this project on anything but an old laptop. I aborted all tasks about a week ago on my desktop after I switched completely to World Community Grid because it demands my entire CPU to keep my GPU fed (Note my RAC, a high end desktop CPU would not have a low RAC). The compute error was caused from restarting the client abruptly. All work units passed successfully until I switched.

I merely ran this project for one day on my high end desktop (FX-8120@4Ghz+HD7950 in POEM/WCG). Not a single task failed.

This has nothing to do with AMD GPUs as far as I am concerned, nor do I see why it would be involved with NVIDIA GPUs.


So you ran Poem, WCG and Rosetta at the same time and had no problems? I have a large handful of pc's and can run Rosetta ONLY on those that are NOT also running gpu projects. Any time I try to run Rosetta units on a pc that is ALSO crunching with the gpu the Rosetta units fail, no matter whether it is an AMD or an Nvidia gpu in the machine.
ID: 74766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mmstick

Send message
Joined: 4 Dec 12
Posts: 8
Credit: 606,792
RAC: 0
Message 74771 - Posted: 22 Dec 2012, 8:42:32 UTC

Yes, I was running POEM and WCG GPU projects and Rosetta@home at the same time. Not a single error.
ID: 74771 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile microchip
Avatar

Send message
Joined: 10 Nov 10
Posts: 10
Credit: 2,269,015
RAC: 3,068
Message 74772 - Posted: 22 Dec 2012, 11:07:54 UTC - in response to Message 74771.  

Yes, I was running POEM and WCG GPU projects and Rosetta@home at the same time. Not a single error.


Same here on Linux with BOINC 6.12.34. Everything runs fine even with GPU crunching

Team Belgium
ID: 74772 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74773 - Posted: 22 Dec 2012, 12:57:55 UTC - in response to Message 74772.  

Yes, I was running POEM and WCG GPU projects and Rosetta@home at the same time. Not a single error.


Same here on Linux with BOINC 6.12.34. Everything runs fine even with GPU crunching


I guess we can now see why the issue hasn't been fixed yet, because it is only seen by some people, not everyone! That is like taking you car to the mechanic because it is making a noise and when you get there it stops!
ID: 74773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 15,074
Message 74777 - Posted: 24 Dec 2012, 16:44:43 UTC

One more fresh example of 100% error rate at R@H
https://boinc.bakerlab.org/rosetta/results.php?hostid=1582894&offset=20

Kepler GPU as well (this time mobile version in notebook). Errors starts at 21.12.12 after video driver update.
Was ver. 306.97 and R@H runs relatively normal.
After update to 310.70 NV driver - 100% error rate at validation stage.
ID: 74777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JAMES DORISIO

Send message
Joined: 25 Dec 05
Posts: 15
Credit: 201,201,447
RAC: 48,100
Message 74779 - Posted: 24 Dec 2012, 18:59:09 UTC
Last modified: 24 Dec 2012, 19:23:39 UTC

I just upgraded another computer to Ubuntu linux 12.04 amd64 ,nvidia driver 310.14, Boinc 7.0.27, all downloaded from the Ubuntu repository. I have the run time preference set at 12 hours.

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1485068

This computer was running Ubuntu 10.04 amd64 nvidia driver 304.** Boinc 6.10.17.
All hardware remained the same, it was and still is running Gpu work from GPUgrid and WCG on a GTS450. Just an upgrade to Ubuntu 12.04 along with the new versions of Boinc and nvidia drivers that came with it. Before the upgrade it ran with no errors after the upgrade it has produced 3 errors out of 3 work units. I have stopped new tasks from Rosetta@home for now. It is successfully completing WCG human proteome folding phase 2 work units from WCG which use Rosetta software.

The below computer also is affected by this bug. See message 74598

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1579123

I have 3 more computers to upgrade but it looks like they will not be able to run here if I do. For now I will hold off. It would be nice if someone from the Rosetta staff could post here so we know they are looking into this.
Thanks Jim
ID: 74779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74781 - Posted: 25 Dec 2012, 12:06:44 UTC - in response to Message 74779.  

I just upgraded another computer to Ubuntu linux 12.04 amd64 ,nvidia driver 310.14, Boinc 7.0.27, all downloaded from the Ubuntu repository. I have the run time preference set at 12 hours.

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1485068

This computer was running Ubuntu 10.04 amd64 nvidia driver 304.** Boinc 6.10.17.
All hardware remained the same, it was and still is running Gpu work from GPUgrid and WCG on a GTS450. Just an upgrade to Ubuntu 12.04 along with the new versions of Boinc and nvidia drivers that came with it. Before the upgrade it ran with no errors after the upgrade it has produced 3 errors out of 3 work units. I have stopped new tasks from Rosetta@home for now. It is successfully completing WCG human proteome folding phase 2 work units from WCG which use Rosetta software.

The below computer also is affected by this bug. See message 74598

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1579123

I have 3 more computers to upgrade but it looks like they will not be able to run here if I do. For now I will hold off. It would be nice if someone from the Rosetta staff could post here so we know they are looking into this.
Thanks Jim


The guy that started this thread is a 'Rosetta guy', but he hasn't come back very much!!
MERRY CHRISTMAS!!
ID: 74781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 15,074
Message 74786 - Posted: 26 Dec 2012, 2:06:39 UTC - in response to Message 74777.  
Last modified: 26 Dec 2012, 2:07:50 UTC

One more fresh example of 100% error rate at R@H
https://boinc.bakerlab.org/rosetta/results.php?hostid=1582894&offset=20

Kepler GPU as well (this time mobile version in notebook). Errors starts at 21.12.12 after video driver update.
Was ver. 306.97 and R@H runs relatively normal.
After update to 310.70 NV driver - 100% error rate at validation stage.


If turn off kepler GPU in notebook bios(switch to integrated)- R@H errors is gone. Turn on - all WUs fails after validation.
Revert to old (v. 306.97) video driver - WUs finish OK again.

So now source of problem is clear - some sort of conflict between R@H and latest NV video drivers (and drivers must be active, not just installed in system).
ID: 74786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tanstaafl9999

Send message
Joined: 8 Mar 12
Posts: 2
Credit: 1,688,827
RAC: 0
Message 74787 - Posted: 26 Dec 2012, 2:39:45 UTC
Last modified: 26 Dec 2012, 2:42:02 UTC

Just to add to the confusion:

I stopped doing Rosetta WU's (the 3.45 ones) several weeks ago because of a 100% failure rate. After all the comments here about the possible connection to GPU crunching, I decided to (temporarily) stop doing GPU work units and see if that would let me do Rosetta WU's without problems.

Without any GPU WU's, I successfully conpleted several Rosetta WU's. So I decided to start doing GPU work units again to see if that would cause the Rosetta WU's to start failing.

After a couple of days running both GPU and Rosetta WU's, I've only had one Rosetta WU fail... (Before, I was getting a 100% failure rate when I was running Rosetta and GPU work units together.)

I'm using the exact same AMD video drivers and I haven't done any hardware changes to my computer recently.

It seems to be working OK at the moment, so I'll keep running GPU and Rosettta work units for the time being and hope it keeps on working.
ID: 74787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74788 - Posted: 26 Dec 2012, 12:38:23 UTC - in response to Message 74786.  
Last modified: 26 Dec 2012, 12:40:01 UTC

One more fresh example of 100% error rate at R@H
https://boinc.bakerlab.org/rosetta/results.php?hostid=1582894&offset=20

Kepler GPU as well (this time mobile version in notebook). Errors starts at 21.12.12 after video driver update.
Was ver. 306.97 and R@H runs relatively normal.
After update to 310.70 NV driver - 100% error rate at validation stage.


If turn off kepler GPU in notebook bios(switch to integrated)- R@H errors is gone. Turn on - all WUs fails after validation.
Revert to old (v. 306.97) video driver - WUs finish OK again.

So now source of problem is clear - some sort of conflict between R@H and latest NV video drivers (and drivers must be active, not just installed in system).


And NOT just Nvidia drivers though, I have a handful of AMD cards and those machines get nothing but errors too.
ID: 74788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74789 - Posted: 26 Dec 2012, 12:39:39 UTC - in response to Message 74787.  

Just to add to the confusion:

I stopped doing Rosetta WU's (the 3.45 ones) several weeks ago because of a 100% failure rate. After all the comments here about the possible connection to GPU crunching, I decided to (temporarily) stop doing GPU work units and see if that would let me do Rosetta WU's without problems.

Without any GPU WU's, I successfully conpleted several Rosetta WU's. So I decided to start doing GPU work units again to see if that would cause the Rosetta WU's to start failing.

After a couple of days running both GPU and Rosetta WU's, I've only had one Rosetta WU fail... (Before, I was getting a 100% failure rate when I was running Rosetta and GPU work units together.)

I'm using the exact same AMD video drivers and I haven't done any hardware changes to my computer recently.

It seems to be working OK at the moment, so I'll keep running GPU and Rosettta work units for the time being and hope it keeps on working.


GOOD LUCK and I hope it keeps working for you!
ID: 74789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 15,074
Message 74791 - Posted: 26 Dec 2012, 19:29:57 UTC - in response to Message 74788.  


And NOT just Nvidia drivers though, I have a handful of AMD cards and those machines get nothing but errors too.


Please add one or two examples of the same error on computer with ATI / AMD cards (and not NV+ATI cards in same computer) to our collection. I saw several dozen computers with this problem and all of them have been installed nVidia card. And in 3 of them replaceing/turn off NV card solve problem (but just stop crunching on it was not enough).
ID: 74791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74795 - Posted: 27 Dec 2012, 12:36:30 UTC - in response to Message 74791.  


And NOT just Nvidia drivers though, I have a handful of AMD cards and those machines get nothing but errors too.


Please add one or two examples of the same error on computer with ATI / AMD cards (and not NV+ATI cards in same computer) to our collection. I saw several dozen computers with this problem and all of them have been installed nVidia card. And in 3 of them replaceing/turn off NV card solve problem (but just stop crunching on it was not enough).


I am crunching for other projects right now so can't do that, but NONE of my machines with AMD cards ALSO have Nvidia cards so I am the example. On my list of pc's, you have to click on All Computers, only my Servers do not have AMD cards in them, they are too old and do not even have pci-e slots in them. They have standalone gpu's, but not ones that can crunch. Sorry I do not see any workunits listed under any of my pc's though.
ID: 74795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 15,074
Message 74796 - Posted: 27 Dec 2012, 14:26:14 UTC - in response to Message 74795.  

2 mikey

Yes, i checks your computers list before my post. But R@H do not show any information about GPUs installed (due very old server side BOINC code) - that's why I asked about the NV card. And yes - not any workunits listed because you stop crunch R@H long ago (at least a few month ago) and all the WU you complete have already been removed from the database.

So if you have a little time, you can connect one or two machines back (or to permit the new job if they just in NNT mode) with AMD card only and were you saw bug before. And let R @ H work say half a day before switch back to your main projects. Or set target CPU time to 1hr and run just few hours - to munimaze lost of resourses.

I was just curious (and as well this may help project programmers pinpoint the cause of the error) - it is the exactly same bug on host with AMD cards, or something else.
On "problem" computers with NV cards they look like:
1. Not just a very large number of errors, but all (100%) WUs fail with no exceptions.
2. Error appears only after WUs validation on server - no any visible errors at client side while WU runs or in WUs logs.
3. In tasks logs information about rosetta version is missing.

+ physical remove (or turn off in BIOS for laptops) of NV card stop this errors. (not sure in all or just in some cases - too few statistics)
ID: 74796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74801 - Posted: 28 Dec 2012, 12:31:33 UTC - in response to Message 74796.  

2 mikey

Yes, i checks your computers list before my post. But R@H do not show any information about GPUs installed (due very old server side BOINC code) - that's why I asked about the NV card. And yes - not any workunits listed because you stop crunch R@H long ago (at least a few month ago) and all the WU you complete have already been removed from the database.

So if you have a little time, you can connect one or two machines back (or to permit the new job if they just in NNT mode) with AMD card only and were you saw bug before. And let R @ H work say half a day before switch back to your main projects. Or set target CPU time to 1hr and run just few hours - to munimaze lost of resourses.

I was just curious (and as well this may help project programmers pinpoint the cause of the error) - it is the exactly same bug on host with AMD cards, or something else.
On "problem" computers with NV cards they look like:
1. Not just a very large number of errors, but all (100%) WUs fail with no exceptions.
2. Error appears only after WUs validation on server - no any visible errors at client side while WU runs or in WUs logs.
3. In tasks logs information about rosetta version is missing.

+ physical remove (or turn off in BIOS for laptops) of NV card stop this errors. (not sure in all or just in some cases - too few statistics)


Give me a couple of days, I am busy with family at the moment AND almost at a mini goal on a cpu project, then I will connect one or two and see what happens.
ID: 74801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74808 - Posted: 31 Dec 2012, 15:58:49 UTC

Mad Max I have add4ed two pc's to Rosetta...the first has NO gpu but is running Boinc version 7.0.40. The second is also running Boinc 7.0.40 but DOES have an AMD gpu in it and it is crunching for Collatz.

We will see what happens from here!
ID: 74808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 74810 - Posted: 31 Dec 2012, 21:45:29 UTC - in response to Message 74808.  

Mad Max I have add4ed two pc's to Rosetta...the first has NO gpu but is running Boinc version 7.0.40. The second is also running Boinc 7.0.40 but DOES have an AMD gpu in it and it is crunching for Collatz.

We will see what happens from here!


[update]
BOTH units completed and got credits! I guess I will bring more pc's over.
ID: 74810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Current issues with 7+ boinc client



©2024 University of Washington
https://www.bakerlab.org