Servers?

Message boards : Number crunching : Servers?

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68001 - Posted: 9 Oct 2010, 19:27:55 UTC

Last night at between 12:30 AM and 01:00 AM PDT, it appeared that all servers were down. I was unable to access even the "Server Status Page."

Now, the Server Status Page shows all servers up and running, but I have 14 completed tasks waiting. Of the 14, 5 failed on "Computation Error[s], and the others are "Ready to Report." I have two tasks running, and four more "Ready to Start." I am running Windows 7 on an old Pentium D.

Is anybody else seeing this problem? Is this just a temporary outage for maintenance?

deesy

ID: 68001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68017 - Posted: 10 Oct 2010, 17:41:53 UTC

My queue of work units finally emptied out last night at about 11:30 PM PDT, and my machine remained idle until 8:42 AM PDT this morning. Then, minirosetta 2.16 was downloaded, along with 17 new work units. The first nine of these work units aborted with computation errors. Two are currently being processed and six more are waiting to start.

2.16 appears to be exacerbating the computation error problem. I also had computation errors before 2.16 was downloaded to my machine, but I'm not sure if they were from 2.15 or 2.14.

The servers appear to be functioning again, but they are not responding very quickly.

deesy
ID: 68017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 68022 - Posted: 11 Oct 2010, 0:33:20 UTC

The whole site was down for about 10 hours on Saturday 9th. When it came back up, my logs show no problem uploading or downloading straight away (sending 6, receiving 8 WUs). I have no idea why your machine didn't dial in for a further 24 hours, but Boinc gets funny sometimes. Your buffer is inadequate, as we discussed before, but apparently you know best so I assume that was intentional. No problems on an unattended machine here because I don't fail to plan. Funny how that keeps on working for me.

On your computation error issues, all those mem_widd tasks show the same failure and Yifan has confirmed the problem is at Rosetta's end - see the 2.16 thread.
ID: 68022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68023 - Posted: 11 Oct 2010, 1:46:39 UTC - in response to Message 68022.  

The whole site was down for about 10 hours on Saturday 9th. When it came back up, my logs show no problem uploading or downloading straight away (sending 6, receiving 8 WUs). I have no idea why your machine didn't dial in for a further 24 hours, but Boinc gets funny sometimes. Your buffer is inadequate, as we discussed before, but apparently you know best so I assume that was intentional. No problems on an unattended machine here because I don't fail to plan. Funny how that keeps on working for me.

On your computation error issues, all those mem_widd tasks show the same failure and Yifan has confirmed the problem is at Rosetta's end - see the 2.16 thread.


My buffer is large enough to hold about 20 tasks. Are you saying that 20 is insufficient?

My machine received no new work for at least 24 hours, even though you say that the site was down for only ten hours. Do you know the algorithm for reconnecting after an outage? Perhaps the servers connect to one user at a time, and completely load their buffers with work before moving on to the next user, instead of giving each contributor two or four tasks so that everybody is able to contribute again as soon as possible. Although such an algorithm would be simpler than ensuring that all contributors have work as quickly as possible, it would be less efficient, wouldn't you agree?

There you go again. Assuming that, just because you have a given experience, therefore everybody must have the same experience is a little simplistic, and it is an example of defective reasoning.

A lot of faulty thinkers believe (mistakenly) that fortune is actually the result of their superior reasoning/planning/business skills. I suppose if you won the lottery you would believe that it was because of your exceptional Math skills.

deesy
ID: 68023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 68029 - Posted: 11 Oct 2010, 11:29:26 UTC - in response to Message 68023.  

My buffer is large enough to hold about 20 tasks. Are you saying that 20 is insufficient?

My machine received no new work for at least 24 hours, even though you say that the site was down for only ten hours. Do you know the algorithm for reconnecting after an outage? Perhaps the servers connect to one user at a time, and completely load their buffers with work before moving on to the next user, instead of giving each contributor two or four tasks so that everybody is able to contribute again as soon as possible. Although such an algorithm would be simpler than ensuring that all contributors have work as quickly as possible, it would be less efficient, wouldn't you agree?

# of tasks in your buffer isn't sufficient information. But you know that, so I leave it to you to fill in the blanks.

The rest isn't relevant if Boinc doesn't call for tasks. Before you posted for the first time my unattended machine had connectedul'ddl'd 5 times successfully over the previous 3.5 hours and, it seems, did so again 11 more times by the time your machine got its first tasks. How many times in the 14hr period (not more) before you received your first tasks did you get the reply "Internet access OK - project servers may be temporarily down" or "Scheduler request completed: got 0 new tasks"?

The first answer I think will be 'none' because the task server was up when the website returned and you posted after that. The second answer, from what you've said, might be once, but seeing as you received many tasks as soon as your machine reported it may also be 'none'. In which case it's a different problem that may be entirely local.

I can't see where I've said no-one else should have a problem because I didn't. What I'm seeking to clarify is whether Boinc on your machine didn't uldl tasks because it didn't actually ask for any. Why that should be, I can't say, but Boinc does get it's task requests wrong a lot (something I see regularly, if not often) so it wouldn't surprise me. Again, that would be a local problem.
ID: 68029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68036 - Posted: 11 Oct 2010, 18:47:58 UTC - in response to Message 68029.  


# of tasks in your buffer isn't sufficient information. But you know that, so I leave it to you to fill in the blanks.


My buffer is large enough to hold a two-day supply of work. If Rosetta's servers are out of service for only nine or ten hours, a two-day buffer should be adequate, shouldn't it?

The rest isn't relevant if Boinc doesn't call for tasks. Before you posted for the first time my unattended machine had connectedul'ddl'd 5 times successfully over the previous 3.5 hours and, it seems, did so again 11 more times by the time your machine got its first tasks. How many times in the 14hr period (not more) before you received your first tasks did you get the reply "Internet access OK - project servers may be temporarily down" or "Scheduler request completed: got 0 new tasks"?


There you go again. I suppose you believe that if it is raining at your house, it must, necessarily, be raining at everybody else's house, too. How simplistic!


The first answer I think will be 'none' because the task server was up when the website returned and you posted after that. The second answer, from what you've said, might be once, but seeing as you received many tasks as soon as your machine reported it may also be 'none'. In which case it's a different problem that may be entirely local.


You think wrong! My machine attempts repeatedly to connect to Rosetta's servers to acquire new work, and receives repeated error messages. I guess you really don't have a very good understanding of how computers and their software work. Do you really believe that all users can be magically and instantly supplied with adequate work when previously-down servers come back on-line? I suppose so! As Arthur C. Clarke once said: "Any sufficiently advanced technology is indistinguishable from magic."

I can't see where I've said no-one else should have a problem because I didn't. What I'm seeking to clarify is whether Boinc on your machine didn't uldl tasks because it didn't actually ask for any. Why that should be, I can't say, but Boinc does get it's task requests wrong a lot (something I see regularly, if not often) so it wouldn't surprise me. Again, that would be a local problem.


Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours? Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary? If you believe that I have somehow acquired a defective version of BOINC, you should say so, and you should tell me (and everybody else) how to correct it.

What do you mean by "a local problem?" Could you be more specific? It appears that you are making unwarranted assumptions, again. Your posts are not particularly helpful, regardless of how knowledgeable you think you might be.

deesy



ID: 68036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 68037 - Posted: 11 Oct 2010, 19:41:27 UTC - in response to Message 68036.  

What do you mean by "a local problem?" Could you be more specific?

50 cm in front of the monitor? ;)

ID: 68037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68039 - Posted: 11 Oct 2010, 19:54:20 UTC

dessy58, in the future, please do not respond to posts you find not particularly helpful. Often when server outages occur, various users have different observations over time as things recover. One user stating their observations after you have stated yours should not be taken as any contradiction nor expectation on what you observe. If you string together 5 or 6 such factual observations, you can often see progress with time on getting things back to normal. And so it is commonplace to make such posts in threads such as this one.

When facts about errors, and retries are omitted from problem descriptions the reader is left to presume many things; especially in a project where you can specify a preference for tasks to run anywhere from an hour to 24hrs.

If the reader frequents a number of BOINC project message boards, they often fill in missing details with similar problems they are familiar with. There are a number of issues where BOINC's core client is not requesting work from projects even when cores are idle.

BOINC versions do not automatically up date themselves, so every client machine can be different. Indeed there are scores of versions possible.

When you have a number of rapid failures in a row, the BOINC core client can get confused about how long to expect tasks to take to complete and has trouble requesting a proper amount of work to match the desired network preferences. It appears a batch of v2.16 tasks that were sent out failed on startup. These have since been removed. When such pervasive problems are encountered, the servers get bottlenecked trying to replace and reissue the failing tasks.

No, project servers do not contact the attached machines when they recover from an outage... in order or otherwise. Project servers never contact the attached machines, the architecture is always client-pull, not server-push. Depending upon how many times your machine tried to contact the project, the delay time until the next request gets increasingly large, that may explain any time gap where your machine did not attempt to contact the project for a few hours (if that occurred).
Rosetta Moderator: Mod.Sense
ID: 68039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68040 - Posted: 11 Oct 2010, 20:40:13 UTC - in response to Message 68039.  

dessy58, in the future, please do not respond to posts you find not particularly helpful. Often when server outages occur, various users have different observations over time as things recover. One user stating their observations after you have stated yours should not be taken as any contradiction nor expectation on what you observe. If you string together 5 or 6 such factual observations, you can often see progress with time on getting things back to normal. And so it is commonplace to make such posts in threads such as this one.

When facts about errors, and retries are omitted from problem descriptions the reader is left to presume many things; especially in a project where you can specify a preference for tasks to run anywhere from an hour to 24hrs.

If the reader frequents a number of BOINC project message boards, they often fill in missing details with similar problems they are familiar with. There are a number of issues where BOINC's core client is not requesting work from projects even when cores are idle.

BOINC versions do not automatically up date themselves, so every client machine can be different. Indeed there are scores of versions possible.

When you have a number of rapid failures in a row, the BOINC core client can get confused about how long to expect tasks to take to complete and has trouble requesting a proper amount of work to match the desired network preferences. It appears a batch of v2.16 tasks that were sent out failed on startup. These have since been removed. When such pervasive problems are encountered, the servers get bottlenecked trying to replace and reissue the failing tasks.

No, project servers do not contact the attached machines when they recover from an outage... in order or otherwise. Project servers never contact the attached machines, the architecture is always client-pull, not server-push. Depending upon how many times your machine tried to contact the project, the delay time until the next request gets increasingly large, that may explain any time gap where your machine did not attempt to contact the project for a few hours (if that occurred).


Thanks for the advice, Mod.Sense. I suppose I shouldn't be responding to a Troll. It's just that the implications in his post are that NOBODY ELSE experiences any idle time, and that it is because of something I am doing wrong, but which can't be explained by anybody. Actually, I have received e-mail from other participants that contradict his assertions, so I shouldn't really pay any attention to his sarcastic rants.

Your explanation could, however, be a little more clear:

Why, for example, might it be the case that some users appear to have their work buffers replenished very quickly after an outage, while others are forced to wait a number of hours for additional work? Why is this case even if BOINC is exited and restarted, and why would it be the case even if "Update" is selected? Is there anything at all that can prevent this idle time on a user's computer (other than selecting a massive 10-day buffer)?

I understand that the system uses a "pull" technique to distribute work. What I don't understand is why my machine repeatedly reports that the servers might be down even when the "Server Status Page" reports all servers up and running.

Not that this is a big thing, but the Troll seems to be implying that I am doing something wrong if I accept BOINC defaults. If the settings are wrong, why would they be defaults? Do you have any thoughts on this specific matter?

BTW, my tasks are set to run for 4 hours with a two-day buffer. I have a broadband connection that is always available, and I have selected no restrictions in my preferences (that I am aware of). I allow 100% of the processors on my machine to be used 100% of the time. I switch between applications every 50,000 minutes. (Is this wrong?)

I allow a maximum of 100 Gigabytes of disk space or 50% of total available disk space. I restrict swap space to 75% of the page file. Memory usage is limited to 70% when the computer is in use, and 90% when the computer is idle. Applications remain in memory when suspended.

Which of these settings would you recommend be changed to ensure that my machine will not be idled for prolonged periods after a Rosetta server outage?

Thanks!

deesy


ID: 68040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68041 - Posted: 11 Oct 2010, 22:42:29 UTC

What I saw this weekend - which may be different than what you saw - is that neither my BOINC client nor my browser could hit bakerlab.org.

Ping would not work and when I attempted to do a nslookup on bakerlab.org the address would not resolve. So from my perspective it appeared to be network related where I could not get to the correct name server.

After a period of time when things started coming back up bakerlab.org would resolve and I could hit it with my browser. However, the BOINC client still would not connect and upload the completed tasks. Even when I did a "do network activity"

After waiting for several hours and still not getting the desired uploads completed I noticed on one system I had taken down to patch that as soon I restarted BOINC the uploads went right through.

I recycled the clients on the other machines and everything went through.

It looked like during the outage either the IP address for bakerlab.org changed, or somehow during the outage my BOINC clients picked up a bad address for bakerlab.org and the BOINC client had the old / bad address cached.

But thats just what it looked like on my systems. Your results may have been different.
ID: 68041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68042 - Posted: 11 Oct 2010, 23:40:42 UTC

I, also, was unable to successfully ping BOINC during the outage. Since it appears that there really was a server outage, this would not be surprising.

deesy
ID: 68042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael Gould

Send message
Joined: 3 Feb 10
Posts: 39
Credit: 15,412,342
RAC: 2,788
Message 68047 - Posted: 12 Oct 2010, 5:42:12 UTC - in response to Message 68040.  


Which of these settings would you recommend be changed to ensure that my machine will not be idled for prolonged periods after a Rosetta server outage?

Thanks!

deesy




Of course, you could attach to another distributed project as a backup (set to a very low resource share), and then your machine will never be idle! Well, hardly ever...
ID: 68047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68050 - Posted: 12 Oct 2010, 16:22:14 UTC - in response to Message 68040.  

Your explanation could, however, be a little more clear:

Why, for example, might it be the case that some users appear to have their work buffers replenished very quickly after an outage, while others are forced to wait a number of hours for additional work? Why is this case even if BOINC is exited and restarted, and why would it be the case even if "Update" is selected? Is there anything at all that can prevent this idle time on a user's computer (other than selecting a massive 10-day buffer)?


If you picture a bank, with 1000 customers demanding to withdraw their savings immediately, only with no lines... just a free-for-all on how and when a customer reaches a teller, that is what the world is like for an internet server (of any kind, not just BOINC). So while the bank is open (i.e. servers are running), your request may timeout before it gets a reply. So, in addition to cases where BOINC clients sit idle, not requesting more work when you would think they should, the response times from an overloaded server are always highly variable. It just boils down to random chance. When a request is received, it is fulfilled the same way as when the servers are not busy (i.e. the request, if it is completed, gets assigned all of the work necessary to fill it, assuming enough work is available). So the small percentage of requests that complete in the chaotic bank lobby, are handled with the same care and attention to detail as when the bank is not busy.

Updating to the project only puts your hat in the ring more times (assuming the updates are not so frequent that you get the last request too recent messages). If the server is still busy, you still only have a 1% chance of getting lucky. The good news is that the servers are generally able to work through such backlogs in just a few hours, and so many people don't even notice there was an outage.

Exiting and restarting BOINC would not be expected to change anything. When it restarts, it retains the backoff timers that it ended with. So if you have completed tasks to be uploaded, and they have retried several times and were waiting until 9PM to try again, they will still wait until 9PM, regardless of whether you restart BOINC.

So far as preventing idle time, I would suggest installing a current BOINC version if you have not already done so. There have been a number of changes recently to the work scheduling methods used on the client that should help. Also, the previous suggestion of attaching to more then one project is a good one. BOINC now allows you to establish a resource share of zero to indicate a project that you don't wish to maintain debts to, but wish to contact if no work is available from your projects with non-zero resource shares.

I understand that the system uses a "pull" technique to distribute work. What I don't understand is why my machine repeatedly reports that the servers might be down even when the "Server Status Page" reports all servers up and running.


I hope I've addressed that above. The bank is open, but a customer that's been waiting for an hour and still not received service might perceive otherwise.

Not that this is a big thing, but xxxxxx seems to be implying that I am doing something wrong if I accept BOINC defaults. If the settings are wrong, why would they be defaults? Do you have any thoughts on this specific matter?


Any system which allows you to configure things does so because there are cases when such a configuration may be exactly what the user of the system desires. When someone reports X, Y and Z are becoming a problem for them, the response will likely be configuration changes that will help optimize for those specific issues. For example, you have a full-time internet connection, but some users are sharing a dial-up connection with 5 other people. They might use settings to help assure they minimize network bandwidth, or use it only at night. Every situation is different, and every BOINC project is different. The preferences are an attempt to give the user enough control to be happily crunching the projects they wish to, without interfering with other uses of their machine.

BTW, my tasks are set to run for 4 hours with a two-day buffer. I have a broadband connection that is always available, and I have selected no restrictions in my preferences (that I am aware of). I allow 100% of the processors on my machine to be used 100% of the time. I switch between applications every 50,000 minutes. (Is this wrong?)


No combination of settings is "wrong". The question is whether it is appropriate for whatever your goals are. 50,000 minutes between application switches is a long time, but if your tasks are completing in 4 hours as planned, it isn't going to matter. BOINC will do a very good job at switching only after a checkpoint has been reached. The 60min default is more common. But, as I say, it's not hurting anything. If you only have one BOINC project, the setting is essentially ignored because there is no other application to switch to.

I allow a maximum of 100 Gigabytes of disk space or 50% of total available disk space. I restrict swap space to 75% of the page file. Memory usage is limited to 70% when the computer is in use, and 90% when the computer is idle. Applications remain in memory when suspended.


Sounds good.

Which of these settings would you recommend be changed to ensure that my machine will not be idled for prolonged periods after a Rosetta server outage?


A larger buffer of work, and a backup project are the primary ways to avoid idle time. The other is connecting to projects that consistently have work available, and have servers that are available most of the time. You've already done that by selecting Rosetta@home. Beyond that, accepting that when your machine has no work to process it is using much less electricity, and accepting that out of 365 days in a year, a half day of idle time once and a while is a very small fraction of time.
Rosetta Moderator: Mod.Sense
ID: 68050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68052 - Posted: 12 Oct 2010, 18:48:05 UTC

If you picture a bank, with 1000 customers demanding to withdraw their savings immediately, only with no lines... just a free-for-all on how and when a customer reaches a teller, that is what the world is like for an internet server (of any kind, not just BOINC). So while the bank is open (i.e. servers are running), your request may timeout before it gets a reply. So, in addition to cases where BOINC clients sit idle, not requesting more work when you would think they should, the response times from an overloaded server are always highly variable. It just boils down to random chance. When a request is received, it is fulfilled the same way as when the servers are not busy (i.e. the request, if it is completed, gets assigned all of the work necessary to fill it, assuming enough work is available). So the small percentage of requests that complete in the chaotic bank lobby, are handled with the same care and attention to detail as when the bank is not busy.

Updating to the project only puts your hat in the ring more times (assuming the updates are not so frequent that you get the last request too recent messages). If the server is still busy, you still only have a 1% chance of getting lucky. The good news is that the servers are generally able to work through such backlogs in just a few hours, and so many people don't even notice there was an outage.


I understand your analogy to a bank. Wouldn't this be similar to the way that Web servers and Database servers function?

Would it be an acceptable design for a Database server to fill a request for multiple records (many more than the client could possibly process at the time) while other requests go unfilled for as long as several hours. I can imagine the havoc that would be generated in a large client-server or Web-based ERP system (for example) if such a strategy for recovery after an outage were employed. Would it be an improvement to limit the number of work units that were distributed to each user for some period of time after a server outage - say 24 hours? Why not ration the work until such time as the system has completely recovered from the outage. It's difficult to see how such an approach would not be more efficient.

Food for thought.

deesy
ID: 68052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68053 - Posted: 12 Oct 2010, 20:06:00 UTC

If I understand you right, you are suggesting that the backlog will be cleared sooner if the bank limits withdraws to $250 per customer per some time limit. But to do so, now each teller has to verify the time limit, one which varies for each specific customer, before completing a transaction, and the customer has to make multiple transactions just to get the $1,000 they came for. Won't this make each transaction take slightly longer? And wouldn't that make it take longer to get on top of the backlog?

The approach, and this is from the Berkeley server code, is instead to try to make best use of each contact with the client. "best" here meaning send them everything they need. You don't know when they will be able to connect again. This might be the only work request the machine is allowed to make all week. Any client requesting "more work then it could possibly process" is already refused. In general, if everything is running well on the client side, no such requests are ever made.

I believe what you are suggesting though is to only send one task per CPU for example. Figuring that this will keep the client machine happy for a little while, and by the time they finish those, perhaps the server will be back to normal again. The idea has some merit, indeed I've had thoughts along those lines myself, but the code changes to determine when to enter this server conservation mode add additional overhead and are quite complex, and the potential benefits are fairly minimal. There are a certain number of database hits that occur for each scheduler request, regardless of the amount of work being sent. So if you are doing 20 IOs already, to send out 1 task, why not do 5 more and send out 6 tasks and fulfill the entire request? Your alternative is to process multiple server hits, doing the 20 IOs multiple times to send out 6 tasks.

You could argue that there are multiple server hits occurring now as well, and that is certainly true. But when a web server is backlogged and requests are timing out, the server basically doesn't even actually see the ones it couldn't get to in time, so they bare no cost to the server's performance. So the approach taken by Berkeley yields the least demand for resources on the server overall. The number of requests actually processed is minimized by this approach.

No project will have work available all of the time. With a two day buffer, you have already mitigated your risk of not having work during an outage. Unfortunately for you, it would seem your two days was up (i.e. your need for more work occurred) during an outage. If the average outage is 6 hours and the average server recovery time is 2 hours, your 2 day buffer already reduced your odds of encountering a server backlog to 1 in 6. Given those (roughly historical, yet guesstimated) numbers, 5 out of 6 times you would be completely unaware of a 6hr outage when carrying a 2 day buffer of work.
Rosetta Moderator: Mod.Sense
ID: 68053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68055 - Posted: 12 Oct 2010, 22:31:33 UTC - in response to Message 68053.  

If I understand you right, you are suggesting that the backlog will be cleared sooner if the bank limits withdraws to $250 per customer per some time limit. But to do so, now each teller has to verify the time limit, one which varies for each specific customer, before completing a transaction, and the customer has to make multiple transactions just to get the $1,000 they came for. Won't this make each transaction take slightly longer? And wouldn't that make it take longer to get on top of the backlog?


I don't think your example is exactly on point. It takes the same amount of time to withdraw $1 as it does to withdraw $1,000. Perhaps a better example might be the situation where you are standing in line at the bank because you only need to cash a check. Five places in front of you in the line is a person who is purchasing seven cashier's checks, certifying two additional checks, depositing the childrens' piggy banks, making a mortgage payment, making a car payment, and paying all of his/her utility bills.

Supermarkets have solved this type of problem (at least in the U.S.) by implementing "Express Lanes" where only a limited number of items can be checked out. Before the establishment of such conveniences in virtually all supermarkets, it was possible to stand in line for 15 or 20 minutes (or more) just to pay for a single carton of milk.

The approach, and this is from the Berkeley server code, is instead to try to make best use of each contact with the client. "best" here meaning send them everything they need. You don't know when they will be able to connect again. This might be the only work request the machine is allowed to make all week. Any client requesting "more work then it could possibly process" is already refused. In general, if everything is running well on the client side, no such requests are ever made.


Hmm. If I understand correctly, a user can request as much as ten days worth of additional work to be loaded into a buffer. If that user has a quad-core processor, and is using the default 3-hour run time, how many work units would be downloaded to that user during a single connection, and how much time would that process take? Assume that the user is, as you point out as an example, using a dial-up connection, and the server must wait for the completion of the transaction before responding to a request from another user.

I believe what you are suggesting though is to only send one task per CPU for example.


Actually, no. Since the number of processors in use by the average user is one, two or four, I would suggest that the number of tasks to be downloaded during the first connection after a server outage be limited to four.

The idea has some merit, indeed I've had thoughts along those lines myself, but the code changes to determine when to enter this server conservation mode add additional overhead and are quite complex, and the potential benefits are fairly minimal.


I agree that the task would be complex. This might, however, be one of those situations where Occam's Razor might not apply. Without some sort of cost/benefit analysis, we'll never know. The questions is, what would be the overall effect on the productivity of the project as a whole?

There are a certain number of database hits that occur for each scheduler request, regardless of the amount of work being sent. So if you are doing 20 IOs already, to send out 1 task, why not do 5 more and send out 6 tasks and fulfill the entire request? Your alternative is to process multiple server hits, doing the 20 IOs multiple times to send out 6 tasks.


Could you expand on this a little?

You could argue that there are multiple server hits occurring now as well, and that is certainly true. But when a web server is backlogged and requests are timing out, the server basically doesn't even actually see the ones it couldn't get to in time, so they bare no cost to the server's performance. So the approach taken by Berkeley yields the least demand for resources on the server overall. The number of requests actually processed is minimized by this approach.


Well, yes, if one focuses only on the loading of the server, and not on the production of the entire grid. Let me try a different analogy:

Suppose we have a network of pumps that remove water from an area that is prone to flooding (New Orleans, The Netherlands, etc.) and the supply of diesel fuel used to power the pumps has been temporarily interrupted. When a shipment of fuel arrives, would it be better to start with the first pump and completely fill its fuel tank, giving it enough fuel to run for several days (or more), but at the expense of having insufficient fuel to run other pumps? Or might it be better to ration the fuel amongst all of the pumps, ensuring that all of them are brought back on line as quickly as possible? Perhaps all pumps could be kept running until additional shipments of fuel arrive at the levees/dikes.

No project will have work available all of the time. With a two day buffer, you have already mitigated your risk of not having work during an outage. Unfortunately for you, it would seem your two days was up (i.e. your need for more work occurred) during an outage. If the average outage is 6 hours and the average server recovery time is 2 hours, your 2 day buffer already reduced your odds of encountering a server backlog to 1 in 6. Given those (roughly historical, yet guesstimated) numbers, 5 out of 6 times you would be completely unaware of a 6hr outage when carrying a 2 day buffer of work.


This sounds completely reasonable and logical, but it does not reflect my experiences during two server outages since I have been participating in this project. If I have a two-day buffer (plus the tasks that are currently being processed), and if the servers are only out of service for about nine hours, how is it that my machine works it way through the entire buffer and runs out of work for (in the most recent case) more than nine hours? The numbers do not seem to add up.

I don't think I am the only one who has thought about this question, so any additional light you might be able to shed would probably be appreciated by others, also.

Thanks for sharing your knowledge.

deesy


ID: 68055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68071 - Posted: 13 Oct 2010, 14:11:51 UTC

It takes the same amount of time to withdraw $1 as it does to withdraw $1,000.


You've got me there. An indication that we are hitting the end of the usefulness of the analogy. I was focused on the resulting demands for service rather then specific service times.

If that user has a quad-core processor, and is using the default 3-hour run time, how many work units would be downloaded to that user during a single connection, and how much time would that process take? Assume that the user is, as you point out as an example, using a dial-up connection, and the server must wait for the completion of the transaction before responding to a request from another user.


Your assumptions are not pertinent. The server can send back a response with 25 tasks or whatever, the actual downloads of the required files and etc. is independent of that and performed by other servers. Also, as a process waits for disk IO on the server, other work is performed, so it is not as simple as the queue at the bank. It's like a teller that services two other customers while the coins are jingling in the counting machine for the deposit from the arcade, then calls the arcade depositor back up to the window to give them their receipt.

Could you expand on this a little?


At risk of furthering the bank analogy, it would be like saying that once I've called up your account and verified your identity, it takes very very little time to deposit three checks rather then one. The server can probably respond with 20 tasks in less than double the CPU time and database IO that it takes to respond with a single task. And the number of hosts requesting that much work in one shot would probably be down below the 2% area. So rerouting them or handling them differently won't have much impact.

This sounds completely reasonable and logical, but it does not reflect my experiences during two server outages since I have been participating in this project. If I have a two-day buffer (plus the tasks that are currently being processed), and if the servers are only out of service for about nine hours, how is it that my machine works it way through the entire buffer and runs out of work for (in the most recent case) more than nine hours? The numbers do not seem to add up.


...ah! NOW you are coming around to where some of the original conversation was focused. If you can crunch 2 days worth of work in 9 hours, it clearly indicates that you did not truly have 2 days of work; right? Simple as that. And that means the BOINC client did not request enough to truly maintain a 2 day cache. And that is exactly the sort of thing other people were observing throughout the BOINC community, and changes were made to BOINC to refine the work-fetch rules to avoid idle CPUs.
Rosetta Moderator: Mod.Sense
ID: 68071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 68075 - Posted: 13 Oct 2010, 19:32:37 UTC - in response to Message 68071.  


...ah! NOW you are coming around to where some of the original conversation was focused. If you can crunch 2 days worth of work in 9 hours, it clearly indicates that you did not truly have 2 days of work; right? Simple as that. And that means the BOINC client did not request enough to truly maintain a 2 day cache. And that is exactly the sort of thing other people were observing throughout the BOINC community, and changes were made to BOINC to refine the work-fetch rules to avoid idle CPUs.


Not really! I never said that I could crunch two days of work in nine hours. The servers might have been down for 9 hours, but my machine received no new work for more than 24 hours. The result was that all of the work in my buffer was completed, and I had more than twenty tasks waiting to report. During the first 48 hours after the onset of the outage, my machine continued to work through the contents of its buffer. When that was all completed, it was forced to wait for additional work from Rosetta's servers, which was not forthcoming for an additional nine hours. This is the issue that nobody has been able to explain to me, or to anybody else who has experienced the same problem. For additional clarification, please go back and re-read my first two posts on this thread. Note that the interval between them was almost 22 hours. Note also that the servers had already been down for some number of hours before I noticed it and made my first post.

If I/O on the servers really occurs the way you explain, then it would take no longer at all to supply my dual-core, broadband-connected machine with four new tasks than it would to supply User X with 300 tasks over a dial-up connection. This is illogical on its face, despite the capabilities of computer multi-tasking. It assumes that a single disk access could supply all 300 records to be sent, and that the communications I/O time is essentially zero. It also assumes an infinite number of simultaneous client connections.

The Rosetta hardware is, indeed, powerful, and the use of a fiber SAN to connect the servers to the storage system is a great system design feature. The disks are quite fast, but they are still rotating storage, and the volume of I/O is still quite significant. Even though the disks are organized as a single LUN, aren't access times going to vary? Could anybody be confident that all of the data required for a large download would be co-located on the same track[s], or on adjacent tracks of the disk subsystem? Does the caching system completely eliminate any access latency? Is the cache "hit rate" 100%?

All this might be the case, but I remain skeptical. No offense.

I'm still not at all clear why, with a two-day buffer, my machine becomes idle, and remains idle, after brief temporary server outages. If the resource "cost" for a large buffer is insignificant, why doesn't BOINC default to a much larger one - say 10 days or so?

deesy
ID: 68075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 68076 - Posted: 13 Oct 2010, 20:13:19 UTC - in response to Message 68075.  
Last modified: 13 Oct 2010, 20:53:48 UTC

I'm still not at all clear why, with a two-day buffer, my machine becomes idle, and remains idle, after brief temporary server outages.


Because Rosetta does not guarantee to provide you with work at all times. BOINC includes the option to subscribe to other useful projects to keep your computer busy when your main project is experiencing difficulties. That you have chosen to work exclusively with Rosetta is admirable, but you do have to accept the pitfall that you have no backup and that you will likely be one of the first to experience any difficulties and see them last longer than other users.

People who have been able to select more than one project encounter problems more rarely as their systems adjust automatically to compensate.

If the resource "cost" for a large buffer is insignificant, why doesn't BOINC default to a much larger one - say 10 days or so?


Because a one size fits all approach doesn't work. One example is that a default setting of a 10 day buffer would be problematic for people with intermittent connections. If you have a set of tasks with a 12 day turn around but only connect to the network once every 7 days, you will return some of your tasks on the 7th day but the rest will expire on day 12; when you reconnect to the network to upload your tasks on day 14 you will find that all work between day 7 and day 10 has expired and is effectively wasted.

For some people a 10 day work buffer may be an ideal solution, which is why you can customise your settings to match your own circumstances.
ID: 68076 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68077 - Posted: 13 Oct 2010, 20:34:53 UTC

If the resource "cost" for a large buffer is insignificant, why doesn't BOINC default to a much larger one - say 10 days or so?


Well there are other costs, such as the database and disk space consumed by the existence of a task. So BOINC tries to ONLY request the number of tasks the client will reasonably be expected to complete in the number of days of the desired buffer size. Hence, client machines don't request 300 tasks at a time. There are essentially zero such requests to handle any differently then the existing system.

As for others experiencing the same problem, people run out of work during outages all the time. No mystery there. The specifics of why it happened to you are unclear. You've not provided a complete picture of what was observed. Specifically, what messages were observed when the core client hit the scheduler, and whether the indication was that new work was being requested, or not.

Considering that tasks are assigned from a shared memory segment, one could assert the equivalent of a 100% cache hit rate, although there's a background task doing I/O to refill the shared memory buffer. My point was not that there was zero resource required to fulfill a larger request, simply that the difference between a request for 1 task, and a request for dozens is minimal.

I am really unclear why you presume other users were requesting, and being granted piles of surplus tasks while you were without work.
Rosetta Moderator: Mod.Sense
ID: 68077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Servers?



©2024 University of Washington
https://www.bakerlab.org