Message boards : Number crunching : Cannot retrieve new work
Author | Message |
---|---|
Brian Priebe Send message Joined: 27 Nov 09 Posts: 16 Credit: 33,020,247 RAC: 0 |
BOINC event log reports this afternoon: "03-Aug-2014 15:08:34 | rosetta@home | Server can't open database". |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
There has been a massive surge in new users recently (see the graphs at BOINC stats) and the servers are struggling to keep up with demand. Things should settle down once the surge slows and the new users have downloaded the core database files. |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Hi Brian, are you still seeing the error? I checked, we still have workunits in queue (so you should be getting some). |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
|
Brian Priebe Send message Joined: 27 Nov 09 Posts: 16 Credit: 33,020,247 RAC: 0 |
are you still seeing the error? I checked, we still have workunits in queue (so you should be getting some). My machines are getting new work again. |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,006 RAC: 517 |
I'm seeing a slightly different problem. From my boinc logs: 06-Aug-2014 04:30:57 [rosetta@home] Sending scheduler request: To fetch work. 06-Aug-2014 04:30:57 [rosetta@home] Requesting new tasks for CPU 06-Aug-2014 04:31:00 [rosetta@home] Scheduler request completed: got 0 new tasks 06-Aug-2014 04:31:00 [rosetta@home] No work sent This is happening more often than not on all 4 of my systems. They try to get new work but get nothing. I also crunch for WCG and POEM (on my one 64 bit Linux box - all my systems run Linux) so I have work, just not rossetta. Once in a while I'll get a rosetta workunit but not often. Charlie -Charlie |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Right the available work seems to be getting consumed about as quickly as it is being generated. The project is still adjusting to all of the new hosts that have all come at once. Which is a great problem to have! But I've seen on the server status page the actual number of tasks ready to send has been swinging rapidly as new work is generated, and then assigned to hungry hosts. The BOINC Manager will do retries for work and pull some down when work units are available. Rosetta Moderator: Mod.Sense |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
lol! over 1.1 million tasks in progress. I can not get any new task for my hosts. |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
lol! over 1.1 million tasks in progress. I can not get any new task for my hosts. 2 million in progress now. I just hope most of them come back completed rather than hitting 10 day expiration. Rosetta Moderator: Mod.Sense |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,006 RAC: 517 |
lol! over 1.1 million tasks in progress. I can not get any new task for my hosts. Curious as to how you get those numbers. (of course, you're on the inside so may have access to better info than we mere mortals do :-) From the home page in the upper right corner I see this: Server Status as of 7 Aug 2014 16:07:41 UTC [ Scheduler running ] Total queued jobs: 378,664 In progress: 916,437 Then from the server status page, I see this: As of 7 Aug 2014 17:31:52 UTC State Approximate #results Ready to send 15,330 In progress 691,956 Are the Total Queued jobs and ready to send supposed to be the same? I realize the times these numbers are generated are different, but I would expect them to be close. Thanks for any insight. Charlie -Charlie |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
Values in the the Server Status page are sometimes changing dramatically from one update to next update, not sure whether it is providing real status data. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, it's been incredible! I'm referring to the server status page linked from the homepage. There were 2 million in progress when I posted and now there are 1,760,000 and at some point in between there were about 1 million. So the progression was: 2 million in progress Then over a million reported back as completed, actually a period where a million more than were assigned had reported back. Then you caught it with about 1 million in progress. Now we're back to over 1.7m so over 700,000 more were assigned out to hosts than were reported back. There have been about 100,000 new hosts added this past week. At this point they are all considered "active". Previously the project was running about 60,000 active hosts returning results recently. So even if those hosts just work on a single task at a time, if they run for the default 3hr runtime, that would be 800,000 tasks per day just on the new hosts. Then you multiply by some average number of CPUs and resource share per host and it's a dramatic whole lotta work getting done! Try to keep in mind that the servers are keeping up fairly well, and that the scale of the project has more than doubled in less than a week. That is a tough feat. So, there will be some growing pains. There will be points in time where all of the WUs have already been assigned to other hosts. Even when work is queued up it takes the server time to generate BOINC WUs out of it so they can be assigned. The underlying databases, networks and file systems are all seeing a dramatic change in workload as well. It will take some time to find and resolve bottlenecks that did not previously exist. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
At such an early stage there are bound to be many questions, but I found at least one answer: Aug 07, 2014 Predictor of the day: Congratulations to ce223411 for predicting the lowest energy structure for workunit gr071414_2h5_2h5_697_fold_SAVE_ALL_OUT_175228_0 ! |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
We've encouraged the scientists to submit more jobs. Hopefully that helps. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
There have been about 100,000 new hosts added this past week. At this point they are all considered "active". Previously the project was running about 60,000 active hosts returning results recently. So even if those hosts just work on a single task at a time, if they run for the default 3hr runtime, that would be 800,000 tasks per day just on the new hosts. Then you multiply by some average number of CPUs and resource share per host and it's a dramatic whole lotta work getting done! Another thing that crossed my mind is that when a new host arrives, the tasks it gets are a bit flaky until a pattern establishes itself with how much work gets done. It's quite probable that they've pulled down a mass of new tasks, some of which will be returned speedily, while others may even get timed out due to inactivity. On that basis I'm expecting it to be a further 10 days before things settle down. I also note new users are still being added rapidly, albeit now slowing progressively each day |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I checked a few hosts at random over the past few days. The new ones typically pulled down one or two tasks, and if they returned them they got 3 or 4 more. One of the things BOINC has to try and get a handle on early on is how many hours per day the computer is likely to be running BOINC. There were many hosts I saw that were added on the third which still had not returned their first completed WU. Either the machine hasn't been running, there wasn't network access, the server was choked at the time they tried to report, or they've turned off the charity engine without aborting their current task. But, for the most part it looked like new hosts were beginning to return results and begin to settle in to a regular workflow. It also appeared to me that most new hosts were not running more than a few hours per day. Otherwise they would have completed more work over the course of several days. It appears that while there are over 6 million tasks in the queue, the dedicated tasks that churn those into BOINC WUs are having trouble keeping ahead of the demand for tasks. There were over 2 million tasks in progress earlier in the day. Now it shows half that. Yet still only 426,000 successful completed tasks in the past 24hrs. If those figures are accurate and consistent snapshots in time, it implies that half of the reported WUs were not successes. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
I checked a few hosts at random over the past few days. The new ones typically pulled down one or two tasks, and if they returned them they got 3 or 4 more. One of the things BOINC has to try and get a handle on early on is how many hours per day the computer is likely to be running BOINC. There were many hosts I saw that were added on the third which still had not returned their first completed WU. Either the machine hasn't been running, there wasn't network access, the server was choked at the time they tried to report, or they've turned off the charity engine without aborting their current task. But, for the most part it looked like new hosts were beginning to return results and begin to settle in to a regular workflow. It also appeared to me that most new hosts were not running more than a few hours per day. Otherwise they would have completed more work over the course of several days. That's where I got the 10 days from - failure to meet deadlines and reissue to (hopefully) active crunchers |
Message boards :
Number crunching :
Cannot retrieve new work
©2024 University of Washington
https://www.bakerlab.org