Message boards : Number crunching : Servers?
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
In rereading your original posts, I believe the extended backlog on the servers was due to the tasks that had a new optional parameter specified, which was no longer supported by the v2.16 version (the failures you mentioned). So a lot of the work that was being sent out did not keep the machines busy. The tasks immediately failed. The machine immediately required more work and the cycle continued. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Seems clear to me. deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Because Rosetta does not guarantee to provide you with work at all times. BOINC includes the option to subscribe to other useful projects to keep your computer busy when your main project is experiencing difficulties. That you have chosen to work exclusively with Rosetta is admirable, but you do have to accept the pitfall that you have no backup and that you will likely be one of the first to experience any difficulties and see them last longer than other users. Whether Rosetta makes any specific guarantees is not the issue. The issue is why some users might receive additional work immediately after a server outage while other users must wait for an additional (sometimes prolonged) period before receiving new tasks. People who have been able to select more than one project encounter problems more rarely as their systems adjust automatically to compensate. I began participating in the Seti@home project shortly after its inception. I switched to Predictor@home for the period during which it was running in California. Then I switched back to Seti for a while, before joining the Folding@home project under a couple of different user and team names. I accumulated more than a million points under my most recent user ID before switching to Rosetta@home. I am processing for Rosetta for a very specific reason. I believe that their project is more nearly an "applied science" than it is a "theoretical science." I understand that the Project is searching for the causes, cures and treatments for specific diseases. If my understanding of Rosetta's ultimate goals were to change, I would probably look for a different grid computing project to join. As of now, there are no others of which I am aware that would persuade me to apply any of my computing resources.
I take your point. Thanks for the clarification. deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Servicing one host's request for 8 tasks is hardly excessive. And, again, we're not even certain your machine was correctly requesting new tasks. We are left to presume that BOINC SHOULD have been smart enough to be requesting tasks when it is completely out of work to do (it is NOT that smart in all versions for various reasons). Nor do we have a clear picture of whether you were sitting there hitting update so frequently that all of your requests were refused, because your prior request was too recent (which would be indicated in the messages at the time). Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
In rereading your original posts, I believe the extended backlog on the servers was due to the tasks that had a new optional parameter specified, which was no longer supported by the v2.16 version (the failures you mentioned). So a lot of the work that was being sent out did not keep the machines busy. The tasks immediately failed. The machine immediately required more work and the cycle continued. Although I am not positive of this, I thought we were still working with minirosetta 2.14 when the outage occurred. Was this not really the case? When was the switch from 2.14 to 2.15 made? I do not believe that my machine's lack of work had anything to do with Version 2.16, although the initially large number of computation errors clearly was. Do I understand correctly that if a machine is sent a number of tasks that fail shortly after being downloaded and processing has begun, then that machine can almost immediately request additional work and enter the queue ahead of all the other machines that are still waiting for work? Or is it that there really is no queue at all -- that it is solely a matter of luck whether a server is available and idle during the very few milliseconds during which a request for work is received? deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
no queue at all -- solely a matter of luck Yes, the host will be short of work very quickly in such a case and then be vying for a scheduler request with everyone else. Odds being no better nor worse at getting new tasks. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
no queue at all -- solely a matter of luck Thanks. That explains a lot. It never ceases to amaze me how so many people confuse simple luck with superior abilities. :) deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I believe the implication was that perhaps you were having a problem more specific to your machine then the project in general (such as being handed a small pile of tasks that fail immediately). Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
I believe the implication was that perhaps you were having a problem more specific to your machine then the project in general (such as being handed a small pile of tasks that fail immediately). Not exactly. I believe that my observed problem began before the introduction of minirosetta Version 2.16 tasks. Even though a number of 2.16 tasks failed immediately on initiation after receipt, I still received at least two tasks that began running successfully within one minute, and I had several more that were entered into my buffer. My issue is, and always has been, that it takes as long as two days or more after an outage for my machine to acquire new work. As a result, all of the tasks in my buffer are completed and my machine stands idle for a number of hours waiting for work. This has occurred on more than one occasion, and I was at a loss to explain it, especially since some other users were asserting that it was because of something I was doing incorrectly. Now that you have explained clearly that it is purely a matter of chance that determines when a user might receive new work after a server outage, these misleading assertions by others are effectively debunked, and I realize that there is probably nothing at all wrong with my machine or its settings. It seems clear that those other users were, for whatever reasons, "blowing smoke" in order to spread FUD. It seems clear that a good analogy might be playing the lottery. One might win a jackpot with the very first ticket purchased, but one also might purchase tickets religiously for years without winning anything at all. Thanks. deesy |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
My buffer is large enough to hold about 20 tasks. Are you saying that 20 is insufficient? In msg 1 you said you had 14 tasks ready to upload, 5 of which had failed. Where did you get 20 from? If you'd actually had 20 it might well have been enough. Personally I keep 2 days worth of 8 hour tasks too, which on my unattended Vista quad machine running 247 is around 24 tasks. I don't know how many cores you run (2?), but it doesn't seem to add up. I suppose you believe that if it is raining at your house, it must, necessarily, be raining at everybody else's house, too. Looking through this thread I see, in my absence, only one other person confirmed your problem and they found they had to restart Boinc locally to restore connectivity. There's a hint. Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours? Of course there are and running on a wide range of platforms too. Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary? Rosetta yes, Boinc no, it certainly isn't updated when necessary unless you manually download a new version. If you believe that I have somehow acquired a defective version of BOINC, you should say so, and you should tell me (and everybody else) how to correct it. Just as well I didn't say so then. There are more issues than just the software and platform to consider, like with any software or connectivity issue. It appears that you are making unwarranted assumptions, again. I don't think so. You're saying the servers were up but you couldn't connect and you're assuming that's because the servers weren't actually up. But the only person reporting the same as you restarted Boinc locally and the problem went away. I know the servers were up because I connected several times when you were still suffering a problem and I happen to know I don't have a magic key that lets me in and keeps you out. If it really was a lottery I'd expect occasional failures to connect for me and occasional successes for you. That didn't happen for me (I'm just not that lucky, unfortunately) and you never did quite detail answers to my question about what errors you had and at what times... Not that this is a big thing, but the Troll seems to be implying that I am doing something wrong if I accept BOINC defaults. If the settings are wrong, why would they be defaults? Thanks for that. First, as I pointed out before, these are Boinc defaults, not Rosetta defaults (though I doubt that makes a lot of difference in itself tbh). More importantly, defaults in this situation are lowest common denominator across all projects, not tailored to meet every eventuality of this one - especially as a 3 hour runtime and a 0.25 buffer would hardly be recommended as a panacea for all Boinc projects by anyone. That's enough, but 3rd, as has been pointed out already, no project can guarantee uptime, so putting all eggs in one project's basket is going to result in a problem eventually. If Rosetta goes down for a month, we're all running out of Rosetta tasks, though I'll still be running 24/7 here. Would it be an improvement to limit the number of work units that were distributed to each user for some period of time after a server outage - say 24 hours? Why not ration the work until such time as the system has completely recovered from the outage. It's difficult to see how such an approach would not be more efficient. Actually no. If it was a case of the server struggling to meet the demands on it (which I don't accept at all btw, but just say mod.sense is right) a user would get insufficient tasks on an eventual successful connection, so it'll just come back again and again to fill the rest of its buffer, resulting in more hits even after connection was successful, not less. Where this strategy would help is if there were only a few tasks to grab and it was better if everyone got something to get them started. That wasn't the case from what I can recall. (Just seen your convoluted analogy. Very good, but typically it's the one that didn't apply in this situation). I am really unclear why you presume other users were requesting, and being granted piles of surplus tasks while you were without work. For the record, these weren't 'surplus' tasks - simply the rate my unattended quad machine completed tasks over that 36 hour period (plus refilling the buffer). No doubt I had some of the early 2.16 tasks that crashed on start-up among those. This is the issue that nobody has been able to explain to me, or to anybody else who has experienced the same problem. It's sounding more like a local connectivity issue the more I read. Did you try restarting Boinc andor your computer at any point when you saw the servers were reporting as up or is this thread really all down to you blaming someone else before checking at your end? That would explain everything, wouldn't it? Anyway, I'm sure everything's solved now. Until the inevitable next time... ;) |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
In msg 1 you said you had 14 tasks ready to upload, 5 of which had failed. Where did you get 20 from? If you'd actually had 20 it might well have been enough. Personally I keep 2 days worth of 8 hour tasks too, which on my unattended Vista quad machine running 247 is around 24 tasks. I don't know how many cores you run (2?), but it doesn't seem to add up. Well, this is simple first grade Arithmetic. What I said was: "I have 14 completed tasks waiting. Of the 14, 5 failed on "Computation Error[s], and the others are "Ready to Report." I have two tasks running, and four more "Ready to Start." I believe that if you add 14 + 2 + 4 you will get a total of 20. That is, unless you do Arithmetic differently on your planet ... Looking through this thread I see, in my absence, only one other person confirmed your problem and they found they had to restart Boinc locally to restore connectivity. There's a hint. I guess you assume (incorrectly) that EVERYBODY who experiences work outages posts on THIS thread. Not only is Arithmetic invalid on your planet, but it appears that Logic is also invalid. Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours? Q.E.D. Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary? Okay. I am running Version 5.10.28. Is there a newer/better version available for Windows? If you believe that I have somehow acquired a defective version of BOINC, you should say so, and you should tell me (and everybody else) how to correct it. Well, you seem to be focused exclusively on software and platforms in this thread. What else (of any value) do you have to contribute? It appears that you are making unwarranted assumptions, again. Maybe you should read the moderator's posts. It is clear from posts made by Mod.Sense that the time interval during which any user might not receive work after recovery from a Rosetta server outage is purely a matter of chance. Do you dispute that assertion? Thanks for that. First, as I pointed out before, these are Boinc defaults, not Rosetta defaults (though I doubt that makes a lot of difference in itself tbh). More importantly, defaults in this situation are lowest common denominator across all projects, not tailored to meet every eventuality of this one - especially as a 3 hour runtime and a 0.25 buffer would hardly be recommended as a panacea for all Boinc projects by anyone. That's enough, but 3rd, as has been pointed out already, no project can guarantee uptime, so putting all eggs in one project's basket is going to result in a problem eventually. If Rosetta goes down for a month, we're all running out of Rosetta tasks, though I'll still be running 24/7 here. Your penchant for avoiding specific questions is reminiscent of a politician. Perhaps you should consider taking up politics (if you haven't already). Let's boil this down to basics. How do you explain a user with a two-day buffer running out of work for more than nine hours after Rosetta's servers fail for a period of only nine or ten hours? Mod.Sense explained it quite clearly by pointing out that it is a matter of CHANCE/LUCK/FORTUNE/KISMET/HAPPENSTANCE when additional work might be received, and my experience bears out that assertion. If you have an alternative answer that makes any sense at all, perhaps you would like to share it? If I have a two-day buffer for a dual-core processor, then I have 24 tasks in my buffer. This is in addition to the two tasks that are currently being processed. These two tasks will complete anywhere between one minute and four hours after the inception of the server outage, but never in less time. That means that I will have a MINIMUM of 48 hours of processing to complete before running out of work. With a two-day buffer, my machine should never even notice a 9-10 hour server outage. Would it be an improvement to limit the number of work units that were distributed to each user for some period of time after a server outage - say 24 hours? Why not ration the work until such time as the system has completely recovered from the outage. It's difficult to see how such an approach would not be more efficient. Well, isn't it the whole point of grid computing that a large number of computers should each be able to perform a small portion of the total work? Tell me how this goal is met if a number of computers receive no work at all while others receive more than they can possibly process within a short time? Users can select a buffer of as many as ten days. What is accomplished for the benefit of the project as a whole when some contributors sit idle waiting for work while other users have accumulated enough tasks to keep their machines busy for an additional ten days? Whether it happens to be the current BOINC/Rosetta design, or whether it is difficult to accomplish, is not the point. The point is that it is INEFFICIENT, and it does not speak well for the architecture of the software. Stop looking at the servers in a microcosm, and look at the entire grid as a whole. It might change your perspective. For the record, these weren't 'surplus' tasks - simply the rate my unattended quad machine completed tasks over that 36 hour period (plus refilling the buffer). No doubt I had some of the early 2.16 tasks that crashed on start-up among those. Well, if they're sitting in your buffer waiting to be processed, then they are (for some period of time) surplus. Your quad cord is not infinite in its capacity. It can process only a certain number of tasks simultaneously. Any tasks that are waiting for current tasks to complete are, by definition, surplus. How is it sane for some users to have extra tasks waiting in a buffer while other users are not able to obtain any tasks at all? This is the issue that nobody has been able to explain to me, or to anybody else who has experienced the same problem. Read my earlier posts. I have a broadband connection that is always available, and I have no access problems with any other Internet sources. You seem determined to rationalize your position that there is something wrong with my system and its settings, but you are unable to explain just exactly what that might be. A little more analysis, and a little less emotion might be useful. Ranting is not. Anyway, I'm sure everything's solved now. Until the inevitable next time... ;) If nothing has changed, why would you believe that everything is solved now? If there was a problem, and no repair was made, wouldn't it be irrational to expect that the problem was solved? How can you be so sure that, the next time Rosetta's servers fail, it won't be you who has to wait for nine or ten hours with no work? deesy |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
From the first link in the quote above you will see the currently recommended version of BOINC for Windows 2000/XP/Vista/7 is 6.10.58. It comes in 32 bit and 64 bit versions. I cannot provide either positive or negative commentary on the latest version as my system is stable with version 6.4.5. I expect that I won't be upgrading until either my system starts to become unstable or I hear of a new feature that may be useful to me. |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Oops! Typo! It is version 6.10.58. Sorry! deesy |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
I guess you assume (incorrectly) that EVERYBODY who experiences work outages posts on THIS thread. No, but you can list all the other threads reporting the same issue if you like, or the individual posts. Perhaps I missed them. It is clear from posts made by Mod.Sense that the time interval during which any user might not receive work after recovery from a Rosetta server outage is purely a matter of chance. Do you dispute that assertion? Certainly not. That's why I asked you whether your machine had actually asked for tasks and what the response was and when. You didn't reply, yet that's pretty crucial information if you actually wanted help. I asked How many times in the 14hr period (not more) before you received your first tasks did you get the reply "Internet access OK - project servers may be temporarily down" or "Scheduler request completed: got 0 new tasks"?. Too late now as my machine's rebooted and I can't compare the detailed answers with my own messages tab. A missed opportunity. Your penchant for avoiding specific questions is reminiscent of a politician. That's rich, considering the above. Let's boil this down to basics. How do you explain a user with a two-day buffer running out of work for more than nine hours after Rosetta's servers fail for a period of only nine or ten hours? I agree, it's inexplicable. But then, with the same 2-day buffer, I didn't run out so it's your question to answer, not mine. I have no complaints with the servers here. Mod.Sense explained it quite clearly by pointing out that it is a matter of CHANCE/LUCK/FORTUNE/KISMET/HAPPENSTANCE when additional work might be received, and my experience bears out that assertion. I'm not sure I understand this. It's down to Boinc when it calls for work, yes, and if there's an excess of calls on the server, it would be random whether you'd be served or someone else, yes, but from what I've written way up the thread, I connected successfully 16 times (none unsuccessful) before you managed to do so once. I tried to ask what kind of error message you got in your connection attempts and you didn't reply with any details, I tried to ask if you tried to connect at all, which is a possibility with Boinc, and you sort of replied, but not in a way that clarified anything (I think it was a yes, but no mention of what the report was in the messages tab), so then it's a matter of your connectivity for the reasons stated by the other person who re-started Boinc after pings failed and connected immediately after. You're suggesting I won a lottery 16 times successively and you didn't win once. Perhaps buying a ticket might help your chances (old gag - sorry). If you want people to help you, you have to help them help you. That's not happening. Stop looking at the servers in a microcosm, and look at the entire grid as a whole. It might change your perspective. This thread is about you (one node on a much bigger network) having a minor problem no-one else even mentioned. The 'perspective' comment is well made, but it doesn't need to be directed at me. Well, if they're sitting in your buffer waiting to be processed, then they are (for some period of time) surplus. Your quad cord is not infinite in its capacity. It can process only a certain number of tasks simultaneously. Any tasks that are waiting for current tasks to complete are, by definition, surplus. How is it sane for some users to have extra tasks waiting in a buffer while other users are not able to obtain any tasks at all? I thought you were complaining about a server problem, not a task shortage (of which there was none until a few hours yesterday which most people failed to notice at all). Which is it? There was no shortage of tasks for anyone who asked for them. The connection problem you had effectively means you didn't successfully ask for any (or ask for any at all). If you had, you'd have got them. It's sounding more like a local connectivity issue the more I read. Did you try restarting Boinc andor your computer at any point when you saw the servers were reporting as up or is this thread really all down to you blaming someone else before checking at your end? That would explain everything, wouldn't it? I asked if you restarted Boinc or re-booted. From the other person who perceived an issue, restarting Boinc solved their connectivity (Boinc to Rosetta) even though their other Internet connections seemed ok the whole while. For some reason it appears to be different. I'm not asking a different question, I'm asking this question. You may not like the question or think it inappropriate, but that's the question I'm asking. I'm not saying it would solve your problem, even though it solved the problem of the only other person who reported the same issue as you, but maybe it would. If it's really help you're asking for, presumably because you couldn't resolve the problem yourself, sometimes it's worth doing something you hadn't thought would work. So did you or didn't you restart Boinc andor your computer? If nothing has changed, why would you believe that everything is solved now? Sense of humour failure at line 10. How can you be so sure that, the next time Rosetta's servers fail, it won't be you who has to wait for nine or ten hours with no work? It won't happen to me simply because I've made contingencies that you haven't. Plus, I have the magic key that lets me in and keeps you out. And apparently I have a very good record with the lottery. (I think that's everything). |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
I guess you assume (incorrectly) that EVERYBODY who experiences work outages posts on THIS thread. Hmm. You really see only what you want to see, don't you. Why don't you take a look at the posts in the current "Houston, we have a problem ..." thread? There are, of course, other posts in other threads, but you probably don't want to look at them because they would not support your preconceived notions. Lame! deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Sid Celery's post is so inane that it doesn't deserve any more of a reply than it has already been given, or that I include here. BTW, my machine repeatedly asked for more work and was told that none was available, and that, perhaps, the servers were down. I am satisfied that Mod.Sense has adequately explained the way the system works. I think that it is not a very efficient system but, apparently, we must live with it. Interestingly, while the posters on the "Houston, we have a problem ..." thread have been experiencing a shortage of work very recently, I have not. Just lucky, I guess ... ;) deesy |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Deesy58 - I think that it is safe to say that while the end result was the same - a problem with getting new work units out to the community, the problem described in this thread and the "Houston we have a problem" thread were completely different in nature. Otherwise I would not have started a new thread. The problem described in the "Houston we have a problem" thread centered completely on the fact that the reservoir to available work units had dropped to zero for a period of 6 to 8 hours or so. Communications between the BOINC client and the project server were up and functional. The problem described in this thread seemed to center around a network issue in the project facility - and judging from the fact that when things came back up my browser could hit bakerlab.org but the BOINC client could not it appeared that maybe the BOINC client cached the "old" IP address when it was brought up and that some time during all this that address changed. The results of ping, nslookup, and trace route commands seem to support this. Was it a change in network configuration or maybe DHCP got in the middle, I don't know for sure, and likely never will. But since once I could hit the name server and resolve bakerlab.org again, BOINC still would not connect until after a restart it is logical to assume that BOINC does indeed cache the address instead of doing a lookup each time. I admit that it is speculation, and that I don't have the facts to conclusively state that this is the exact scenario. However, I have had my hands deep in the bowels to many an IP stack and feel comfortable that this was a logical conclusion to draw. So because the problems were different in nature, your references to the other thread seem a little weak and unrelated to me. But that is just my two cents worth and I don't think either of us were made privy to the technical details by the project. Have a good night and don't stress so much over past problems - I think in both these cases it is clear the problem was on the "project side" and not with your system, my system, or Sid's system. |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Deesy58 - The problem might have been different, but the symptoms appear to have been similar: My system has been out of Rosetta work for 8 hours but finally managed to get a new task about 30 minutes ago. This isn’t your OP, but it is a part of a post in your thread. Although this is part of but a single post, I am fairly certain that I recall other posts on other threads that have similarly reported a complete exhaustion of buffered work, followed by an idle period lasting for a number of hours before new work was received after Rosetta server outages. I will leave it to others, if they are sufficiently interested, to seek the specific posts that report idle time waiting for work. Why, for example, did this particular poster run out of work while nobody else appeared to do so at the same time? My machine was busy during the period that posts were made to your thread. Are you sure that you are not describing a distinction without a difference? The problem described in the "Houston we have a problem" thread centered completely on the fact that the reservoir to available work units had dropped to zero for a period of 6 to 8 hours or so. Communications between the BOINC client and the project server were up and functional. I am sure that your description is technically accurate. What portion of Rosetta contributors, however, do we imagine understand or care about specific server or IP protocol issues when they simply visit a “Server Status Page” that tells them all servers are up and running, but their computers are idle, waiting for work? The results of ping, nslookup, and trace route commands seem to support this. Was it a change in network configuration or maybe DHCP got in the middle, I don't know for sure, and likely never will. It appears that, regardless of whether boinc.org responds to the ping or nslookup commands, rosetta.org does not always respond without an error. When you were unable to resolve the name, I had no problems. Earlier this evening I had no problems. Now, however, as I write this message (11:45 PM PDT on 10/20/2010) I receive the following message: *** cdns2.cox.net can't find rosetta.org: Server failed What conclusions can be drawn from this intermittent error message? But since once I could hit the name server and resolve bakerlab.org again, BOINC still would not connect until after a restart it is logical to assume that BOINC does indeed cache the address instead of doing a lookup each time. I’m not sure your speculation fully describes the nature of the problem[s]. If you are correct, why wouldn’t this issue affect all users during the time of the connection failure[s]? So because the problems were different in nature, your references to the other thread seem a little weak and unrelated to me. But that is just my two cents worth and I don't think either of us were made privy to the technical details by the project. I know what Mod.Sense explained. It fits with my experience, and it seems to fit with the experiences of some others. If, however, it is accurate, then it seems to me that the methods of distributing work during the first – say 24 hours – after a server outage is really not optimum. It makes little sense that User "A" receives sufficient work to fill a ten-day buffer while User "B" sits idle for nine, ten or more hours waiting for work. Have a good night and don't stress so much over past problems - I think in both these cases it is clear the problem was on the "project side" and not with your system, my system, or Sid's system. I don’t think I am stressing very much at all. Also, if past problems are not solved, they have a penchant for becoming future problems - no? I have been trained throughout my career to identify problems, analyze them, and offer solutions or prompt others to offer solutions. Does anybody believe that this is a bad thing? I perceive a problem with the way work is distributed after a server outage. I cannot be certain that there is a viable solution to that problem. Perhaps it is not solvable. But if it is possible to find a solution, then somebody should be looking for and proposing it. If nothing changes, nothing can improve. Thanks for your lucid and helpful contribution to the thread. deesy |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
As you are quoting me there I might as well respond. Though the "symptoms" may sound similar, it is indeed a completely different problem. The issues you have had relate to problems connecting to the server. The issue described in the other thread was that the work queue ran out of tasks for a short period of time. The Project team are usually quite good at refilling the work queue on time, so I would guess that they were just caught by surprise by the jump in project speed from the normal 100 Teraflops to the current 121 Teraflops. A 1/5 increase in speed probably emptied the queue a lot faster than they were expecting. You have implied that my system was idle while Rosetta was short of work, but that wasn't really the case. When my system couldn't get work from Rosetta it just downloaded an extra task from Poem@home and crunched on that for a while. Both projects aim to improve our understanding of proteins, so it doesn't matter to me which one is running. In answer to your question of why I ran out of Rosetta work and you didn't, it is simply a matter of buffer sizes. You mentioned above that you have a 2 day buffer so an 8 hour shortage of work would probably not have been noticed. However my system is powered down quite often, which can confuse BOINC's calculations on the size of buffer to maintain and lead to missed deadlines, so I keep the buffer at minimal levels. |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Okay, I understand that you were able to process for a different project. The point is, from the perspective of the Rosetta Project your machine was idle. Any of us who use BOINC and process for any of the projects that are managed by BOINC can process for a different project if the server systems of our primary project go down. It is not very logical to say that we were able to process for some other project, therefore, we were contributing to Rosetta. Folding@Home is also a protein research project. Could we say that our machines were contributing to the same type of research if we switched to FAH while the Rosetta servers were unable to supply work? Aren't the projects dissimilar in many respects? Is Poem@Home working on solutions to the very same problems as Rosetta? If so, wouldn't that be an unnecessary duplication of efforts and waste of resources that would make it difficult for Project management to obtain grant monies? I am truly sorry that I appear to be unable to make my point with sufficient clarity that everybody can understand it. I think I'll give up trying ... :( How big is your buffer? deesy |
Message boards :
Number crunching :
Servers?
©2024 University of Washington
https://www.bakerlab.org