Message boards : Number crunching : Stuck on uploading is a new problem?
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I am not sure how you are measuring that, but BoincTasks shows that my Ubuntu WU is stuck at 2.039% of 1569.45k (also for an rb ...), which comes out to 32.00k. Since that is half of 64k, it must mean something. I trust the appropriate expert will figure out what. |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,813,902 RAC: 3,404 |
|
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
SO WHAT TO DO? Option 1: Stay on Rosetta and hope that they fix it. Option 2: Go to another project and hope that they fix it. Option 3: ??? (Fill in the blank) EDIT: I am doing a little of each, keeping a couple of cores on Rosetta but adding TN-Grid to the mix. |
Matt Send message Joined: 7 Sep 10 Posts: 8 Credit: 1,240,825 RAC: 0 |
9 stuck uploads. BOINC no longer receiving tasks from Rosetta. |
TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 5,084,721 RAC: 1,942 |
Have one WU stuck on upload since at least Friday, Other Rosetta tasks seem to complete and upload fine. If this problem is so elusive, why aren't there any admins/programmers actively communicating with folks that have this problem in order to solve it? Ralf |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Have one WU stuck on upload since at least Friday, Other Rosetta tasks seem to complete and upload fine. We can see the errors in our logs. I don't know what additional info would help from users but if you have info that may help, please continue to post it here. The issue manifested at around 11:30 am on April 12. See Luki's (our systems engineer) notes: 1) The problem started on Wednesday 4/12 at 11:30 am PST. All web servers started misbehaving at the same time. Upload timeouts went from 100x (from ~80/day to ~8000/day). 2) Out of ~37000 unique client IPs that uploaded results this week, only ~1400 are affected (4%). So it's not random. 3) The network or machine loads have not really changed. 4) The size of the uploaded results is only ~600 KB, yet the upload stalls after ~8 KB (client stops sending data to server). Hence the upload_handler waits for more data until apache times out the request. The upload handler uses no CPU and causes no IO load (yet). 5) As you know, the web server nodes are directly connected to UW switches; our switched can't be to blame here. Still, I tried moving one of the public IPs (bsrv5) to another server, connected to another switch UW -- the problem moved with it instantaneously. 6) The culprit really seems to be at the network level, like the TCP ACKs don't make it to the client; yet we capture them on the wire. Is UW dropping them? Like they trigger an IDS? Luki We are still trying to figure out what is causing this and will keep you all posted if we make progress. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
I don't really want to volunteer and I don't have any sniffers set up right now (or so I claim), but that seems like something you can test fairly easily from any client on the outside. The way you describe it sounds like a filter is picking up certain ACKs and blocking them, so it seems you should be looking at the characteristics of the ACKs that should be going to the blocked packets in comparison to the ACKs that are making it. The bug seems to be in a can, which is the easy kind to pin down. (At least I have not yet noticed any stuck results getting unstuck, and the late result is apparently going to ignore its deadline.) Your [Luki's] packet size estimates don't seem to make sense to me, but all of my data definitely disagrees with the 8 K thing... I'm seeing 64 K (about 40 TCP/IP packets?) or variations around 3 times that, and someone else has reported a 32 K failure. In my own case, I estimate the blocked packets are only around 5%. Also I can say that enough uploads have succeeded from the Mac that I'm pretty sure it isn't affected, which may be another clue. By the way, I'm just taking the transfer progress from the "Transfers" tab in the BOINC client. Right now this machine shows 4 stuck results, with one showing 0.06 and the other three 64.00. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
A little update. One stuck task was miraculously uploaded yesterday. 16-Apr-2017 17:15:48 [rosetta@home] Started upload of 14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0 16-Apr-2017 17:16:02 [rosetta@home] Finished upload of 14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0 16-Apr-2017 17:16:06 [rosetta@home] Sending scheduler request: To report completed tasks. 16-Apr-2017 17:16:06 [rosetta@home] Reporting 1 completed tasks 16-Apr-2017 17:16:06 [rosetta@home] Not requesting tasks: don't need 16-Apr-2017 17:16:09 [rosetta@home] Scheduler request completedhttps://boinc.bakerlab.org/rosetta/result.php?resultid=910051019 Now I have 4 remaining tasks stuck on uploading state. You can see that they break the "8 KB rule". https://s28.postimg.org/43l918fy5/rosetta_stuck_tasks.png Maybe 3 tasks stalled 4 times, so that's why uploaded file size is 32KB (=4*8KB) and not 8KB. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Luki was referring to a specific job and was not speaking generally. The data size is variable. I think this issue is not just affecting uploads, when I tried my last post on this thread last night it timed out and the data was truncated. I had to manually remove the truncated post and repost. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Luki was referring to a specific job and was not speaking generally. The data size is variable. Sorry about my confusion, but I keep saying I don't want to spend much time on this as I keep spending time on it... I should have reread Luki's reply more carefully. However, I'm pretty sure that I have seen some stuck packets change their sticking points. That was a few days ago, but right now on the two machines at hand I see 6 stuck results, and all of them are at 64.00 or 0.06 (as reported for the larger results). If the lost ACK packets can be identified, perhaps they can be padded or unpadded differently? Something to change whatever characteristic the unknown filter is blocking them for? Ugly suggestion in the neighborhood of network internals, but sometimes that's where the spamming scammers (or worse) force us to go... By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues). One more thought... More generalized network problems affecting other connections? Naw, I don't think I want to go there. At least not yet. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues). Today I updated my Ubuntu machine from 16.10 to 17.04 in the hope that it would unstick it, but it did not. However, three other Rosetta WUs have finished and uploaded on that PC, so it is not getting worse, and I will just let it run. Since it is quite unusual for Rosetta, and started at the same time as on my Win7 machine (no longer on Rosetta), it would appear to be the same problem to me. |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues). My host (4 stuck task now) has Xubuntu 14.04.5, Linux kernel 3.13.0-116-generic. |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
I'm getting 1 WU that hangs for every 40-50 that process normally (under Linux). Just a thought - have you guys taken a WU known to have the problem, and tried running it on another host to see if anything looks unusual? |
TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 5,084,721 RAC: 1,942 |
As mentioned before, I doubt that this is something networking/IP related or protocol/IDS related, as WUs from the same host get uploaded just fine. And it is certainly not Windows 10 related, as someone else mentioned in his response, as in my case, the host in question is Windows 7 Pro/64 bit... And those 3 WUs on the other laptop that I mentioned now show up with a validation error. strangely enough... Ralf |
Keith Dale Send message Joined: 12 Apr 14 Posts: 1 Credit: 1,049,251 RAC: 0 |
I have two WUs that have been stuck uploading (one @ ~34% and the other @ 100%) for several days now. Other WUs have been uploaded and credit received with no problem. I'm running a Mac, btw - Sierra v10.12.4 Keith |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
Your [Luki's] packet size estimates don't seem to make sense to me, but all of my data definitely disagrees with the 8 K thing... I'm seeing 64 K (about 40 TCP/IP packets?) or variations around 3 times that, and someone else has reported a 32 K failure. Yes, all mine are a consistent and small size. 32k for me for what that information's worth. And all on Windows 7 machines. Lots of tasks still going through fine. But when one task has a problem it keeps on with that problem for days on end - hence my idea of a flag getting set somewhere. I upload 24 tasksday. Total of 3 problems, so about 45 go through straight away. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Well, I'm pretty well convinced it is affecting all rosetta@home projects on all of the major platforms. Apparently I'm just lucky that I haven't seen one on my Mac or Linux boxen, but most of my machines are running Windows 10 most of the time. I have at least seven of the stuck-on-uploading results now. Still not convinced that it isn't something at the network level, however. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
Another (stuck) task got sent this night. 18-Apr-2017 04:05:28 [rosetta@home] Started upload of rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0 18-Apr-2017 04:05:46 [rosetta@home] Finished upload of rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0 18-Apr-2017 04:05:48 [rosetta@home] Sending scheduler request: To report completed tasks. 18-Apr-2017 04:05:48 [rosetta@home] Reporting 1 completed tasks 18-Apr-2017 04:05:48 [rosetta@home] Not requesting tasks: don't need 18-Apr-2017 04:05:51 [rosetta@home] Scheduler request completedhttps://boinc.bakerlab.org/rosetta/result.php?resultid=910050184 Why are some tasks uploaded successfully after days? Now I have 3 remaining (stuck) tasks. |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Di 18 Apr 2017 17:28:24 CEST | rosetta@home | Started upload of 12dslfv7_gb_0003_0005_11_0001_SAVE_ALL_OUT_480505_41_0_0 Doesn't work. 64,00 / 809,82 KB, then stops... |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Looks like it's been fixed, but I'm still curious what the problem was. My stuck-on-uploading results started clearing yesterday, and the first four machines I've checked are all good to go now. Anyone still having problems? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
Stuck on uploading is a new problem?
©2024 University of Washington
https://www.bakerlab.org