World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Lost result.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 5

[ ]

Author

This topic has been viewed 3173 times and has 4 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Lost result.

I had lines in my stdoutdae.txt which reads:
1] Message from server: Completed result DSFL_00000110_0000034_0116_1 refused: result already reported as success
2] Message from server: Resent lost result DSFL_00000114_0000046_0405_1

Formatting the messages as:
1] Message from server: Completed result WU-refused refused: result already reported as success
2] Message from server: Resent lost result WU-resent

The WU-uploaded list contains the same WUs as in the WU-refused list; and the WUs in the WU-refused list is not the same WUs as those in the WU-resent list.

The WU-refused list are WUs which minutes earlier at that time (or 06-Feb-2012: 06:42:32 up to 06:43:25 to be exact), I uploaded (WU-uploaded). The "Ready to report" did not get to be cleared in the usual timely fashion. I got impatient, and next clicked the "Update" button, as the following line indicate:
06-Feb-2012 06:48:00 [World Community Grid] update requested by user

The scheduler may have been busy as the following lines suggests:
06-Feb-2012 06:49:06 [---] Project communication failed: attempting access to reference site
06-Feb-2012 06:49:06 [World Community Grid] Scheduler request failed: Timeout was reached
06-Feb-2012 06:49:08 [---] Internet access OK - project servers may be temporarily down.

My queries:
1] Will my uploaded WUs be handled the usual way?
2] What makes for a "lost result"?
3] Why did I get a "lost result"?
4] How will a "lost result" be handled during validation?
;
; ------------------------------------
; edits:
; 1 & 2 > spell/grammar check
; 3 > added absolute time to complement relative time
;

----------------------------------------
[Edit 3 times, last edit by Former Member at Feb 6, 2012 10:52:00 PM]

[Feb 6, 2012 10:13:23 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Lost result.

My queries:
1] Will my uploaded WUs be handled the usual way?

Yes.

2] What makes for a "lost result"?
3] Why did I get a "lost result"?
4] How will a "lost result" be handled during validation?

#2, #3, #4: A "lost result", also commonly called a "ghost-wu" or "ghost-task", happens then client sends a scheduling-message asking for work to server, the server accepts this request and finds some work it assigns to client, but for some reason the client never get the respons from server about the new work.

Then client some time later asks again for more work, the server detects that client is "missing" some tasks, and therefore re-issues these. Normally this 2nd. attempt is successful, and client gets the new work, and any later handling on client and server, including validation, is in these instances just like if client had got the work the 1st. time it asked. Ocassionally it takes multiple tries before client gets the new work, and if it takes too long time (multiple days), it's possible the work won't be re-issued any longer since wu has already been validated or something.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Feb 6, 2012 10:40:39 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Lost result.

Hmm... the explanation in the Ingleside [Feb 6, 2012 10:40:39 PM] post does seem to fit nicely with the flow of the reported sequence of events as narrated by my stdoutdae.txt file.

Going by that explanation, I'd say that the server apparently did not-, but should have-, detected that my machine never got the 'lost' WUs to begin with. If the WU-detection mechanism happened correctly, it is not logical for that server to assume that the client (my machine) is 'missing' some tasks. More to the point: the server seems to have incorrectly assumed that my machine have just successfully downloaded the new WUs (where in reality my machine didn't have those WUs), and the WCG server next labeled the WUs as 'lost' and then 'resent the lost result'. However, the intervening communication failures may have caused the loss of synchronization necessary for a correct assessment of the client situation.

Everything falls in place now and anyways, the bottom line ("just like if client had got the work the 1st. time") is re-assuring. smile

;

[Feb 6, 2012 11:52:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Lost result.

There's the old 'should' word again. The client and server constantly interact and try to make sure that both have the same records wit ha ''better to be save than sorry'' attitude. The "Ready to Report" and the "result already reported as success" being the classic when a confirming handshake failed, where if the client sends the RtR and server receives, then confirms the receipt, but the client failing to receive the acknowledgement, the client simply tries on next connect to send the RtR again.

There's many messages that simply not printed simply because if they were by default some might get confused ''too much information", but had you had the sched_op debug flag set you'd have missed this line:

721 World Community Grid 7-2-2012 1:02:29 [sched_op] handle_scheduler_reply(): got ack for task GFAM_x2aqk_TB_ENRmutS94A_0004928_0166_0

Then reading the stdoutdea.txt file would have made it obvious that that last confirmation had not happened, due whatever glitch.

As it stands this exchange process happens now for nearly 900,000 results daily, today even 950,000 and yet not a single result is going lost... except maybe for the impatient micro-managers. They get served with the specials to keep them busy ;>)

--//--

[Feb 7, 2012 12:19:35 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Lost result.

This is the first time this 'lost result' thing happened to any of my machines. In retrospect, I may have not given enough time for the sync process to work itself out. I'll try to be more patient the next time a similar situation arise.

P.S:
The idea of a 'lost' is not quite an accurate description of what happened to some result and thus 'lost result' is not a correct expression. At the time of the query, I never had the new-WUs to begin with and therefore there is nothing to miss, and nothing to lose. I suggest that the 'lost result' be re-named to 'assigned WU' to more accurately reflect what happened. The word 'resend' is also not quite accurate. If there is any action that needs to be done again, it is the action of try (to download WU to client), and not the action of send (the WU); therefore re-try rather than re-send. Sending a WU twice is an illegal operation, what we want to do instead is re-try the download of that WU which the client does not already-, but should-, have had. Not to mention that there was no first send; nothing was sent; the action of send did not happen in the first place (for the send attempt failed).

Thus, instead of...
Message from server: Resent lost result x

...the following I recommend:
Message from server: Assigned CPU WU x queued for download

where: x is the WU-name.

Notes:
1] The 'CPU' qualifier is needed heading forward to a CPU+GPU world to differentiate CPU-WUs from GPU-WUs.
2] 'WU' is a precise reference to the work about to be downloaded; 'result' is too broad a substitute as to be ambiguous.
3] 'queued' is excellent not only to express the concept of FIFO, line, and sequence, but also to set the stage for a download/upload about to take place.
;
--------------
edit1:2012.02.08We.1305.utc:
1] Added my recommendations in a P.S. paragraph
2] spell/grammar check.

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 8, 2012 1:06:12 PM]

[Feb 7, 2012 3:04:36 PM]

[ ]