Stuck on uploading is a new problem?

Message boards : Number crunching : Stuck on uploading is a new problem?

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81462 - Posted: 16 Apr 2017, 21:19:06 UTC - in response to Message 81461.  


Just woke up another machine with a stuck unit. It was stuck at 64K, but after telling it to retry, the new stuck condition is 0.19/496.52 KB for an re12dslf... project.

Some kind of input buffer size problem? Maybe the rosetta@home people are trying to increase the buffer sizes for incoming data? The problem is somehow related to certain work units requesting smaller buffers than they actually need, and then getting stuck because they can't send the rest of their data?
Retried a few minutes later, and the new stuck status is:

0.06/1.62 MB
64.00/511.73 KB
64.00/219.68 KB
64.00/407.93 KB

I am not sure how you are measuring that, but BoincTasks shows that my Ubuntu WU is stuck at 2.039% of 1569.45k (also for an rb ...), which comes out to 32.00k. Since that is half of 64k, it must mean something.
I trust the appropriate expert will figure out what.

ID: 81462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viking69
Avatar

Send message
Joined: 3 Oct 05
Posts: 20
Credit: 6,813,902
RAC: 3,404
Message 81463 - Posted: 16 Apr 2017, 21:34:42 UTC

I have one too:
wu

SO WHAT TO DO?
Hi all you enthusiastic crunchers.....
ID: 81463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81464 - Posted: 16 Apr 2017, 22:38:43 UTC - in response to Message 81463.  
Last modified: 16 Apr 2017, 22:53:59 UTC

SO WHAT TO DO?


Option 1: Stay on Rosetta and hope that they fix it.

Option 2: Go to another project and hope that they fix it.

Option 3: ??? (Fill in the blank)

EDIT: I am doing a little of each, keeping a couple of cores on Rosetta but adding TN-Grid to the mix.
ID: 81464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Matt

Send message
Joined: 7 Sep 10
Posts: 8
Credit: 1,240,825
RAC: 0
Message 81465 - Posted: 17 Apr 2017, 2:37:08 UTC

9 stuck uploads. BOINC no longer receiving tasks from Rosetta.
ID: 81465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,084,721
RAC: 1,942
Message 81466 - Posted: 17 Apr 2017, 4:59:33 UTC - in response to Message 81465.  

Have one WU stuck on upload since at least Friday, Other Rosetta tasks seem to complete and upload fine.
If this problem is so elusive, why aren't there any admins/programmers actively communicating with folks that have this problem in order to solve it?

Ralf
ID: 81466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81471 - Posted: 17 Apr 2017, 6:21:15 UTC - in response to Message 81466.  

Have one WU stuck on upload since at least Friday, Other Rosetta tasks seem to complete and upload fine.
If this problem is so elusive, why aren't there any admins/programmers actively communicating with folks that have this problem in order to solve it?

Ralf


We can see the errors in our logs. I don't know what additional info would help from users but if you have info that may help, please continue to post it here.

The issue manifested at around 11:30 am on April 12. See Luki's (our systems engineer) notes:

1) The problem started on Wednesday 4/12 at 11:30 am PST. All web servers started misbehaving at the same time. Upload timeouts went from 100x (from ~80/day to ~8000/day).
2) Out of ~37000 unique client IPs that uploaded results this week, only ~1400 are affected (4%). So it's not random.
3) The network or machine loads have not really changed.
4) The size of the uploaded results is only ~600 KB, yet the upload stalls after ~8 KB (client stops sending data to server). Hence the upload_handler waits for more data until apache times out the request. The upload handler uses no CPU and causes no IO load (yet).
5) As you know, the web server nodes are directly connected to UW switches; our switched can't be to blame here. Still, I tried moving one of the public IPs (bsrv5) to another server, connected to another switch UW -- the problem moved with it instantaneously.
6) The culprit really seems to be at the network level, like the TCP ACKs don't make it to the client; yet we capture them on the wire. Is UW dropping them? Like they trigger an IDS?

Luki


We are still trying to figure out what is causing this and will keep you all posted if we make progress.
ID: 81471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81472 - Posted: 17 Apr 2017, 7:07:50 UTC - in response to Message 81471.  


6) The culprit really seems to be at the network level, like the TCP ACKs don't make it to the client; yet we capture them on the wire. Is UW dropping them? Like they trigger an IDS?

Luki


We are still trying to figure out what is causing this and will keep you all posted if we make progress.


I don't really want to volunteer and I don't have any sniffers set up right now (or so I claim), but that seems like something you can test fairly easily from any client on the outside. The way you describe it sounds like a filter is picking up certain ACKs and blocking them, so it seems you should be looking at the characteristics of the ACKs that should be going to the blocked packets in comparison to the ACKs that are making it. The bug seems to be in a can, which is the easy kind to pin down. (At least I have not yet noticed any stuck results getting unstuck, and the late result is apparently going to ignore its deadline.)

Your [Luki's] packet size estimates don't seem to make sense to me, but all of my data definitely disagrees with the 8 K thing... I'm seeing 64 K (about 40 TCP/IP packets?) or variations around 3 times that, and someone else has reported a 32 K failure.

In my own case, I estimate the blocked packets are only around 5%. Also I can say that enough uploads have succeeded from the Mac that I'm pretty sure it isn't affected, which may be another clue.

By the way, I'm just taking the transfer progress from the "Transfers" tab in the BOINC client. Right now this machine shows 4 stuck results, with one showing 0.06 and the other three 64.00.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 81473 - Posted: 17 Apr 2017, 15:38:40 UTC
Last modified: 17 Apr 2017, 15:46:18 UTC

A little update.

One stuck task was miraculously uploaded yesterday.

16-Apr-2017 17:15:48 [rosetta@home] Started upload of 14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0
16-Apr-2017 17:16:02 [rosetta@home] Finished upload of 14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0
16-Apr-2017 17:16:06 [rosetta@home] Sending scheduler request: To report completed tasks.
16-Apr-2017 17:16:06 [rosetta@home] Reporting 1 completed tasks
16-Apr-2017 17:16:06 [rosetta@home] Not requesting tasks: don't need
16-Apr-2017 17:16:09 [rosetta@home] Scheduler request completed
https://boinc.bakerlab.org/rosetta/result.php?resultid=910051019


Now I have 4 remaining tasks stuck on uploading state.
You can see that they break the "8 KB rule".
https://s28.postimg.org/43l918fy5/rosetta_stuck_tasks.png
Maybe 3 tasks stalled 4 times, so that's why uploaded file size is 32KB (=4*8KB) and not 8KB.
ID: 81473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81474 - Posted: 17 Apr 2017, 17:38:00 UTC

Luki was referring to a specific job and was not speaking generally. The data size is variable.

I think this issue is not just affecting uploads, when I tried my last post on this thread last night it timed out and the data was truncated. I had to manually remove the truncated post and repost.
ID: 81474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81475 - Posted: 17 Apr 2017, 18:42:58 UTC - in response to Message 81474.  

Luki was referring to a specific job and was not speaking generally. The data size is variable.

I think this issue is not just affecting uploads, when I tried my last post on this thread last night it timed out and the data was truncated. I had to manually remove the truncated post and repost.


Sorry about my confusion, but I keep saying I don't want to spend much time on this as I keep spending time on it... I should have reread Luki's reply more carefully. However, I'm pretty sure that I have seen some stuck packets change their sticking points. That was a few days ago, but right now on the two machines at hand I see 6 stuck results, and all of them are at 64.00 or 0.06 (as reported for the larger results).

If the lost ACK packets can be identified, perhaps they can be padded or unpadded differently? Something to change whatever characteristic the unknown filter is blocking them for? Ugly suggestion in the neighborhood of network internals, but sometimes that's where the spamming scammers (or worse) force us to go...

By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues).

One more thought... More generalized network problems affecting other connections? Naw, I don't think I want to go there. At least not yet.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81477 - Posted: 17 Apr 2017, 19:21:26 UTC - in response to Message 81475.  

By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues).

Today I updated my Ubuntu machine from 16.10 to 17.04 in the hope that it would unstick it, but it did not. However, three other Rosetta WUs have finished and uploaded on that PC, so it is not getting worse, and I will just let it run. Since it is quite unusual for Rosetta, and started at the same time as on my Win7 machine (no longer on Rosetta), it would appear to be the same problem to me.


ID: 81477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 81479 - Posted: 17 Apr 2017, 22:09:46 UTC - in response to Message 81475.  

By the way, I'm still not convinced it isn't Windows-10-specific. Yes, there was at least one report of a similar problem on Linux, but maybe it was different. I ran one of my Linux boxes all day yesterday without getting a sticker, and my Mac remains sticker free. Firing up a different Linux box today and will let it run all day (but that's including a major upgrade, which confuses all issues).

My host (4 stuck task now) has Xubuntu 14.04.5, Linux kernel 3.13.0-116-generic.
ID: 81479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 81481 - Posted: 17 Apr 2017, 22:15:05 UTC

I'm getting 1 WU that hangs for every 40-50 that process normally (under Linux).

Just a thought - have you guys taken a WU known to have the problem, and tried running it on another host to see if anything looks unusual?
ID: 81481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,084,721
RAC: 1,942
Message 81482 - Posted: 17 Apr 2017, 22:47:18 UTC

As mentioned before, I doubt that this is something networking/IP related or protocol/IDS related, as WUs from the same host get uploaded just fine.

And it is certainly not Windows 10 related, as someone else mentioned in his response, as in my case, the host in question is Windows 7 Pro/64 bit...

And those 3 WUs on the other laptop that I mentioned now show up with a validation error. strangely enough...

Ralf
ID: 81482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Dale

Send message
Joined: 12 Apr 14
Posts: 1
Credit: 1,049,251
RAC: 0
Message 81483 - Posted: 17 Apr 2017, 23:02:43 UTC
Last modified: 17 Apr 2017, 23:06:22 UTC

I have two WUs that have been stuck uploading (one @ ~34% and the other @ 100%) for several days now. Other WUs have been uploaded and credit received with no problem.

I'm running a Mac, btw - Sierra v10.12.4

Keith
ID: 81483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 10,982
Message 81484 - Posted: 18 Apr 2017, 2:35:46 UTC - in response to Message 81472.  
Last modified: 18 Apr 2017, 2:37:50 UTC

Your [Luki's] packet size estimates don't seem to make sense to me, but all of my data definitely disagrees with the 8 K thing... I'm seeing 64 K (about 40 TCP/IP packets?) or variations around 3 times that, and someone else has reported a 32 K failure.

Yes, all mine are a consistent and small size. 32k for me for what that information's worth. And all on Windows 7 machines.

Lots of tasks still going through fine. But when one task has a problem it keeps on with that problem for days on end - hence my idea of a flag getting set somewhere. I upload 24 tasksday. Total of 3 problems, so about 45 go through straight away.
ID: 81484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81487 - Posted: 18 Apr 2017, 5:26:31 UTC

Well, I'm pretty well convinced it is affecting all rosetta@home projects on all of the major platforms. Apparently I'm just lucky that I haven't seen one on my Mac or Linux boxen, but most of my machines are running Windows 10 most of the time. I have at least seven of the stuck-on-uploading results now.

Still not convinced that it isn't something at the network level, however.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 81488 - Posted: 18 Apr 2017, 8:41:08 UTC
Last modified: 18 Apr 2017, 8:42:00 UTC

Another (stuck) task got sent this night.

18-Apr-2017 04:05:28 [rosetta@home] Started upload of rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0
18-Apr-2017 04:05:46 [rosetta@home] Finished upload of rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0
18-Apr-2017 04:05:48 [rosetta@home] Sending scheduler request: To report completed tasks.
18-Apr-2017 04:05:48 [rosetta@home] Reporting 1 completed tasks
18-Apr-2017 04:05:48 [rosetta@home] Not requesting tasks: don't need
18-Apr-2017 04:05:51 [rosetta@home] Scheduler request completed
https://boinc.bakerlab.org/rosetta/result.php?resultid=910050184


Why are some tasks uploaded successfully after days?


Now I have 3 remaining (stuck) tasks.
ID: 81488 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 81490 - Posted: 18 Apr 2017, 15:30:14 UTC

Di 18 Apr 2017 17:28:24 CEST | rosetta@home | Started upload of 12dslfv7_gb_0003_0005_11_0001_SAVE_ALL_OUT_480505_41_0_0

Doesn't work. 64,00 / 809,82 KB, then stops...
ID: 81490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81492 - Posted: 18 Apr 2017, 19:25:34 UTC

Looks like it's been fixed, but I'm still curious what the problem was. My stuck-on-uploading results started clearing yesterday, and the first four machines I've checked are all good to go now.

Anyone still having problems?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Stuck on uploading is a new problem?



©2024 University of Washington
https://www.bakerlab.org