Occasional, sudden Computation Errors.

Message boards : Number crunching : Occasional, sudden Computation Errors.

To post messages, you must log in.

AuthorMessage
Dirk Mittler

Send message
Joined: 16 Nov 09
Posts: 1
Credit: 48,400
RAC: 0
Message 72932 - Posted: 30 Apr 2012, 10:15:22 UTC
Last modified: 30 Apr 2012, 10:32:43 UTC

Hello,

I have a Windows 7 x64-based computer, with 12GB of RAM, and with an Intel i7 quad-core CPU, which is actually organized into "8 virtual cores" (for hyper-threading).

I run "Docking", "Rosetta" and "GPUGRID" on this machine, and currently have BOINC Manager v7.0.25 . I never give more than 1 CPU core to any Work Unit.

The BOINC Manager has the ability to Suspend GPU-based Work Units, independently from suspending All - i.e. CPU-based WUs as well.

Even though the Rosetta WUs I host are not GPU-based, there is a reality to GPU computing which might be helpful for you to know, and possibly relevant to this posting here at Rosetta.

When the command is given from the CPU, to kill the WUs on the GPU, this command cannot be executed at all times. And I suspect that the main reason is, that while CUDA is good at launching 'asynchronous loops' and 'shader subroutines' on my 300+ GPU cores, CUDA today is still very bad at actually killing those. This is due to the GPU really being a separate machine from the CPU trying to manage it.

BTW my graphics card as an nVidia GeForce GTX 460.

Moreover, when we give the command to Suspend a GPU WU, we want to be able to save all the work already done on the WU (back to the real computer), so that work can be resumed later, just as with a regular CPU-based WU. That work by default takes up Graphics RAM. I think that in practice, this is just not always doable. Therefore, when the BOINC manager gives the command to suspend All work units, what can often happen is that the GPUGRID WU in progress, continues to run for some time before finally suspending.

In order for the software to manage this properly, the programmers at GPUGRID have programmed the part of their WU which does run on the CPU, to ignore Suspend requests deliberately at certain times.

I don't have any issues with this, as it's predictable and allows me to make my calculations accordingly.

But I think that this ~problem~ might specifically be affecting Rosetta WUs.

I think that I've noticed a few times now, that when the BOINC Manager gives a command to suspend GPU WUs in progress, this can actually cause all the running Rosetta WUs suddenly to display a Computation Error, even though some of those may only have had 10 minutes or so of work done.

This will often happen, regardless of whether to suspend the GPU work is given manually, or if it's just given because my computer is suddenly in use in the foreground (by me).

And I don't really think that this is an error with the BOINC manager, because it never seems to happen to any Docking WUs.

Also, if there is no GPU work in progress (let's say because I left GPU work manually suspended), and if I then allow the BOINC Manager to suspend all software WUs - i.e. CPU-based WUs in progress - then Rosetta consistently does not seem to suffer from multiple, simultaneous Computation Errors any more.

Thus, if all that is given is a command to suspend CPU WUs in progress, then Rosetta WUs also suspend fine. And resume fine later on.

I suspect that lost WUs mean more to you than they do to me. I only do low-maintenance BOINC computing on the side. I'd like to say that BOINC is my screensaver, but it's not even that. My screensaver is a 3D Text Screensaver, with BOINC running further in the background.

But should the Rosetta program be responding in some inefficient way to the failed attempt to suspend GPU work, then you might want to look in to this.

Dirk
ID: 72932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,174,382
RAC: 3,121
Message 73014 - Posted: 7 May 2012, 10:43:12 UTC - in response to Message 72932.  

Hello,

I have a Windows 7 x64-based computer, with 12GB of RAM, and with an Intel i7 quad-core CPU, which is actually organized into "8 virtual cores" (for hyper-threading).

I run "Docking", "Rosetta" and "GPUGRID" on this machine, and currently have BOINC Manager v7.0.25 . I never give more than 1 CPU core to any Work Unit.

The BOINC Manager has the ability to Suspend GPU-based Work Units, independently from suspending All - i.e. CPU-based WUs as well.

Even though the Rosetta WUs I host are not GPU-based, there is a reality to GPU computing which might be helpful for you to know, and possibly relevant to this posting here at Rosetta.

When the command is given from the CPU, to kill the WUs on the GPU, this command cannot be executed at all times. And I suspect that the main reason is, that while CUDA is good at launching 'asynchronous loops' and 'shader subroutines' on my 300+ GPU cores, CUDA today is still very bad at actually killing those. This is due to the GPU really being a separate machine from the CPU trying to manage it.

BTW my graphics card as an nVidia GeForce GTX 460.

Moreover, when we give the command to Suspend a GPU WU, we want to be able to save all the work already done on the WU (back to the real computer), so that work can be resumed later, just as with a regular CPU-based WU. That work by default takes up Graphics RAM. I think that in practice, this is just not always doable. Therefore, when the BOINC manager gives the command to suspend All work units, what can often happen is that the GPUGRID WU in progress, continues to run for some time before finally suspending.

In order for the software to manage this properly, the programmers at GPUGRID have programmed the part of their WU which does run on the CPU, to ignore Suspend requests deliberately at certain times.

I don't have any issues with this, as it's predictable and allows me to make my calculations accordingly.

But I think that this ~problem~ might specifically be affecting Rosetta WUs.

I think that I've noticed a few times now, that when the BOINC Manager gives a command to suspend GPU WUs in progress, this can actually cause all the running Rosetta WUs suddenly to display a Computation Error, even though some of those may only have had 10 minutes or so of work done.

This will often happen, regardless of whether to suspend the GPU work is given manually, or if it's just given because my computer is suddenly in use in the foreground (by me).

And I don't really think that this is an error with the BOINC manager, because it never seems to happen to any Docking WUs.

Also, if there is no GPU work in progress (let's say because I left GPU work manually suspended), and if I then allow the BOINC Manager to suspend all software WUs - i.e. CPU-based WUs in progress - then Rosetta consistently does not seem to suffer from multiple, simultaneous Computation Errors any more.

Thus, if all that is given is a command to suspend CPU WUs in progress, then Rosetta WUs also suspend fine. And resume fine later on.

I suspect that lost WUs mean more to you than they do to me. I only do low-maintenance BOINC computing on the side. I'd like to say that BOINC is my screensaver, but it's not even that. My screensaver is a 3D Text Screensaver, with BOINC running further in the background.

But should the Rosetta program be responding in some inefficient way to the failed attempt to suspend GPU work, then you might want to look in to this.

Dirk


READ the Client Errors thread, LOTS of people are having the same problems!!
ID: 73014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Occasional, sudden Computation Errors.



©2024 University of Washington
https://www.bakerlab.org