Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 6
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 576 times and has 5 replies Next Thread
wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline
Reply to this Post  Reply with Quote 
Didn't resend lost task Message Question

The 2 devices I have that are using app_info.xml (HCC-ATI 7.05) have started reporting "No Reply" for a few HCC-ATI 7.05 tasks. The Event Log shows the message:

Didn't resend lost task X0930078111036200611211630_2 (expired)

The Workunit Status shows the tasks were sent and then set to No Reply a few minutes later and the only reference in the Event log is the "expired" message. I've seen this occur in the past when the server sent work that was not defined in the app_info.xml but that is not the case with these.

The only thing I see in common for the 5 I have looked at is that they are all repair tasks, but both devices are receiving and processing HCC-ATI 7.05 repair tasks.

While these are a small percent of the WUs processed by these devices it does seem rather odd that it is happening and begs the questions is anyone else seeing this and any idea why it's happening?

Workunit Status
---------------
X0930078111036200611211630_ 3-- - In Progress 11/24/12 18:14:17 11/27/12 13:26:17 0.00 0.0 / 0.0
X0930078111036200611211630_ 2-- - No Reply 11/24/12 18:11:27 11/24/12 18:14:11 0.00 0.0 / 0.0<-Mine
X0930078111036200611211630_ 1-- 705 Error 11/24/12 14:11:29 11/24/12 18:11:17 0.00 62.6 / 0.0
X0930078111036200611211630_ 0-- - In Progress 11/24/12 14:11:27 12/1/12 14:11:27 0.00 0.0 / 0.0

X0930078111040200611211630_ 3-- - In Progress 11/24/12 18:14:19 11/27/12 13:26:19 0.00 0.0 / 0.0
X0930078111040200611211630_ 2-- - No Reply 11/24/12 18:11:27 11/24/12 18:14:11 0.00 0.0 / 0.0 <-Mine
X0930078111040200611211630_ 1-- 705 Error 11/24/12 14:11:30 11/24/12 18:11:17 0.00 62.6 / 0.0
X0930078111040200611211630_ 0-- - In Progress 11/24/12 14:11:27 12/1/12 14:11:27 0.00 0.0 / 0.0

X0930078111047200611211630_ 3-- - In Progress 11/24/12 18:14:21 11/27/12 13:26:21 0.00 0.0 / 0.0
X0930078111047200611211630_ 2-- - No Reply 11/24/12 18:11:27 11/24/12 18:14:11 0.00 0.0 / 0.0<-Mine
X0930078111047200611211630_ 1-- 705 Error 11/24/12 14:11:29 11/24/12 18:11:17 0.00 62.6 / 0.0
X0930078111047200611211630_ 0-- - In Progress 11/24/12 14:11:27 12/1/12 14:11:27 0.00 0.0 / 0.0

X0930078111037200611211630_ 3-- 705 Pending Validation 11/24/12 18:14:18 11/24/12 18:27:56 0.11 72.9 / 0.0
X0930078111037200611211630_ 2-- - No Reply 11/24/12 18:11:27 11/24/12 18:14:11 0.00 0.0 / 0.0<-Mine
X0930078111037200611211630_ 1-- 705 Error 11/24/12 14:11:29 11/24/12 18:11:17 0.00 62.6 / 0.0
X0930078111037200611211630_ 0-- - In Progress 11/24/12 14:11:27 12/1/12 14:11:27 0.00 0.0 / 0.0

X0930078111043200611211630_ 3-- - In Progress 11/24/12 18:14:21 11/27/12 13:26:21 0.00 0.0 / 0.0
X0930078111043200611211630_ 2-- - No Reply 11/24/12 18:11:27 11/24/12 18:14:11 0.00 0.0 / 0.0<-Mine
X0930078111043200611211630_ 1-- 705 Error 11/24/12 14:11:29 11/24/12 18:11:17 0.00 62.6 / 0.0
X0930078111043200611211630_ 0-- 705 Pending Validation 11/24/12 14:11:27 11/24/12 14:53:07 0.08 66.6 / 0.0

Event Log Snip
--------------
24-Nov-2012 12:14:08 [World Community Grid] [sched_op] Starting scheduler request
24-Nov-2012 12:14:08 [World Community Grid] Sending scheduler request: To report completed tasks.
24-Nov-2012 12:14:08 [World Community Grid] Reporting 10 completed tasks
24-Nov-2012 12:14:08 [World Community Grid] Requesting new tasks for ATI
24-Nov-2012 12:14:08 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices
24-Nov-2012 12:14:08 [World Community Grid] [sched_op] ATI work request: 2563.04 seconds; 0.00 devices
24-Nov-2012 12:14:12 [World Community Grid] Scheduler request completed: got 5 new tasks
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] Server version 701
24-Nov-2012 12:14:12 [World Community Grid] Didn't resend lost task X0930078111036200611211630_2 (expired)
24-Nov-2012 12:14:12 [World Community Grid] Didn't resend lost task X0930078111040200611211630_2 (expired)
24-Nov-2012 12:14:12 [World Community Grid] Didn't resend lost task X0930078111047200611211630_2 (expired)
24-Nov-2012 12:14:12 [World Community Grid] Didn't resend lost task X0930078111037200611211630_2 (expired)
24-Nov-2012 12:14:12 [World Community Grid] Didn't resend lost task X0930078111043200611211630_2 (expired)

24-Nov-2012 12:14:12 [World Community Grid] Project requested delay of 11 seconds
24-Nov-2012 12:14:12 [World Community Grid] App version returned from anonymous platform project; ignoring
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] estimated total CPU task duration: 0 seconds
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] estimated total ATI task duration: 23464 seconds
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0960078621347200611221015_0
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078140171200611072136_0
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078140148200611072135_0
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0960078590977200611212253_2
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078650018200611080837_2
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078650222200611080834_2
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078650220200611080834_2
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0930078121060200611211656_0
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0930078121095200611211655_0
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] handle_scheduler_reply(): got ack for task X0900078131518200610310928_2
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] Deferring communication for 11 sec
24-Nov-2012 12:14:12 [World Community Grid] [sched_op] Reason: requested by project
----------------------------------------
Bill P

[Nov 24, 2012 8:09:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Didn't resend lost task Message Question

Is the version correct i.e. it's not a 6.56 with 7.05 mix up (which I thought the techs had fixed by renumbering the app version required?
[Nov 25, 2012 9:25:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Didn't resend lost task Message Question

While these are a small percent of the WUs processed by these devices it does seem rather odd that it is happening and begs the questions is anyone else seeing this and any idea why it's happening?

Due to a bug in scheduling-server your client can be assigned tasks for application/app-version it can't run due to using app_info.xml. Such tasks is immediately rejected by client, but this rejection isn't reported back to server meaning if it's re-issued it will be re-issued in a (nearly) infinite loop and client won't get any other tasks.

Since scheduling-server don't know the difference between assigning tasks being rejected by client or scheduler-replies not making it to client, for the moment instead of trying to re-issue the tasks is marked as "no reply" and they can instead be sent to another computer.

Then the scheduler-bug has been fixed normal re-issue should be enabled again and you shouldn't see these messages any longer. How long it will take to fix the bug is unclear at this point.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Nov 25, 2012 12:56:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline
Reply to this Post  Reply with Quote 
Re: Didn't resend lost task Message Question

Rob and Ingleside, thank you for the responses.
Is the version correct i.e. it's not a 6.56 with 7.05 mix up (which I thought the techs had fixed by renumbering the app version required?

Yes, all then ones I've checked, including 14 additional yesterday, are 7.05.

...Since scheduling-server don't know the difference between assigning tasks being rejected by client or scheduler-replies not making it to client, for the moment instead of trying to re-issue the tasks is marked as "no reply" and they can instead be sent to another computer....

I understand the problem of being sent WUs not configured in the app_info.xml and the work around put in place to compensate for the server bug.

It appears my question was poorly phrased. The expired WUs are for HCC-ATI 7.05. The app_info.xml files are for HCC-ATI 7.05. Both devices are completing and validating thousands of HCC-ATI 7.05 WUs daily.
Given that my question is why just these?
----------------------------------------
Bill P

[Nov 25, 2012 5:10:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Didn't resend lost task Message Question

...Since scheduling-server don't know the difference between assigning tasks being rejected by client or scheduler-replies not making it to client

It appears my question was poorly phrased. The expired WUs are for HCC-ATI 7.05. The app_info.xml files are for HCC-ATI 7.05. Both devices are completing and validating thousands of HCC-ATI 7.05 WUs daily.
Given that my question is why just these?

A re-issue happens if for any of many possible reasons the client is missing one or more tasks that's been assigned to client. This commonly is due to scheduler-reply never making it to client.

If you're using app_info.xml with only HCC-Ati & have also only HCC selected on web-page as possible projects, any re-issues will be for HCC/Ati.

Meaning, if you've got a "scheduler-reply timed-out" or "scheduler-reply corrupt" or something similar, on next scheduler-request any re-issues you'll have will be for HCC-ATI 7.05, and as I've already mentioned due to the bug the scheduling-server don't know HCC-ATI 7.05 is the application you've specified in your app_info.xml-file, and these will instead be marked "no reply".

So, check your log if you've not had a connection-error a few minutes before the posted log-snippet.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
----------------------------------------
[Edit 1 times, last edit by Ingleside at Nov 25, 2012 6:23:45 PM]
[Nov 25, 2012 6:19:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline
Reply to this Post  Reply with Quote 
Re: Didn't resend lost task Message Question

So, check your log if you've not had a connection-error a few minutes before the posted log-snippet.

Thank you Ingleside! I reviewed the Event Logs and it appears that is the answer. In all the ones I looked at there was a sequence of communication failures followed by expired messages. The WU Sent/Return time stamps are within a few seconds of the comm failure and expired Event log message time stamps which I'm sure are differences between the device and server clocks.

Extracted Event Log
-------------------
24-Nov-2012 21:29:41 [World Community Grid] [sched_op] Starting scheduler request
24-Nov-2012 21:29:41 [World Community Grid] Sending scheduler request: To report completed tasks.
24-Nov-2012 21:29:41 [World Community Grid] Reporting 3 completed tasks
24-Nov-2012 21:29:41 [World Community Grid] Requesting new tasks for ATI
24-Nov-2012 21:29:41 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices
24-Nov-2012 21:29:41 [World Community Grid] [sched_op] ATI work request: 1665.78 seconds; 0.00 devices
24-Nov-2012 21:30:14 [World Community Grid] Scheduler request failed: Transferred a partial file
24-Nov-2012 21:30:14 [World Community Grid] [sched_op] Deferring communication for 3 min 54 sec
24-Nov-2012 21:30:14 [World Community Grid] [sched_op] Reason: Scheduler request failed
24-Nov-2012 21:30:18 [---] Project communication failed: attempting access to reference site
24-Nov-2012 21:30:20 [---] Internet access OK - project servers may be temporarily down.
....
.... Omitted lines contain no schedular requests, only task completions and upload Started/Finished messages
....
24-Nov-2012 21:34:15 [World Community Grid] [sched_op] Starting scheduler request
24-Nov-2012 21:34:15 [World Community Grid] Sending scheduler request: To report completed tasks.
24-Nov-2012 21:34:15 [World Community Grid] Reporting 11 completed tasks
24-Nov-2012 21:34:15 [World Community Grid] Requesting new tasks for ATI
24-Nov-2012 21:34:15 [World Community Grid] [sched_op] CPU work request: 0.00 seconds; 0.00 devices
24-Nov-2012 21:34:15 [World Community Grid] [sched_op] ATI work request: 3615.45 seconds; 0.00 devices
24-Nov-2012 21:34:32 [World Community Grid] Scheduler request completed: got 8 new tasks
24-Nov-2012 21:34:32 [World Community Grid] [sched_op] Server version 701
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0900078740077200611221313_2 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600206200611071307_2 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600388200611071304_3 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600392200611071304_3 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600419200611071303_2 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600433200611071302_3 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600387200611071304_3 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Didn't resend lost task X0930078600434200611071302_3 (expired)
24-Nov-2012 21:34:32 [World Community Grid] Project requested delay of 11 seconds
----------------------------------------
Bill P

[Nov 25, 2012 7:25:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread