Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 46
Posts: 46   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3647 times and has 45 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

tongue
[Nov 26, 2006 2:35:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
teletran
Senior Cruncher
Joined: Jul 27, 2005
Post Count: 378
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

I had a result stuck at 100% (on a PC located away from home) last night but aborted the unit before I thought to get the info. Not a post with helpful info, I know, but just wanted to show another instance of this problem.
----------------------------------------
[Nov 26, 2006 4:00:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davidhobbs
Senior Cruncher
England
Joined: Dec 30, 2004
Post Count: 151
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Good news... and bad

Hello again,

The work unit that I re-started has now completed successfully and a new one has been downloaded and started.

Sort of good news that it worked this time, but bad news that we didn't reproduce the original problem.

Sorry!

David.
[Nov 26, 2006 6:58:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Alther
Former World Community Grid Tech
United States of America
Joined: Sep 30, 2004
Post Count: 414
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

I now have a HPF2 UD work unit stuck at 100%, so I hope the following info might help Rick.

Device ID 179553.
Task run time almost 33 hours
No result returned during this time
No graphics viewable (get the "wait a few seconds then return" message)
The PC will have been re-started at least four times, but using hibernation so the work unit itself should not have been interrupted.
The process UD_9930506.exe is using 98% CPU time.
I can see UD.exe also listed in task manager, but I can't see any other WCG processes. Shouldn't there be three of them? Aha! if I look at one of my other machines running the same project I see that ud_9930506 is listed but not consuming any significant processor time, and wcg_hpf2_rosetta.exe is the one using 98% CPU time. Perhaps this will give you a clue?
I shut down the agent (by choosing EXIT from the sys tray icon) and then re-started it. The agent started again from 0% without contacting the grid server.
Shall I kill this process and get a new work unit or do you want to see what happens when this one reaches 100% again?

David.


David, this is great info. This narrows it down quite a bit.

No graphics are visible because the science app actually finished (thus there's nothing to display).

So, what appears to have happened is that the science app ended and the UD process that's responsible for packaging the results up has gone into a loop for some reason. This might be because there is a bug in the science app which hasn't released a resource properly or a bug in UD (though if that's the case, I don't know why we haven't seen it before). Maybe it's a race condition between the two.

In any case, it appears the best solution when you notice the 100% "stuck" condition is to simply exit UD and restart. Yes, it will restart the WU, but it will likely complete this time around.

Thanks again for the detailed info.
----------------------------------------
Rick Alther
Former World Community Grid Developer
[Nov 27, 2006 4:23:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davidhobbs
Senior Cruncher
England
Joined: Dec 30, 2004
Post Count: 151
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

Hi Rick,

I'm pleased to think I may have helped you on your way to understanding the problem. I'm intrigued, however, that you don't seem surprised that it finishes successfully when re-started. My simple mind expects the application to reproduce the symptoms exactly each time if it is working with exactly the same data. (Oh, unless it is a race condition of course, as you suggest).

David.
----------------------------------------
[Edit 1 times, last edit by davidhobbs at Nov 27, 2006 8:36:11 PM]
[Nov 27, 2006 8:34:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Alther
Former World Community Grid Tech
United States of America
Joined: Sep 30, 2004
Post Count: 414
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

Ha! I've seen enough wacky and buggy code in my time that not much surprises me anymore smile

But seriously, knowing the code, I listed those questions earlier because I knew what information I needed to narrow down the problem. Your answers really did narrow it down to only a couple of possibilities. It's almost certainly a race condition. Unfortunately, this problem coupled with the way the application works makes it very difficult to reproduce. But also because of that very nature, it means it will likely complete a second time around.

I've made a packaging change to the next release to turn off a UD feature I turned on for this project. Hopefully this feature was the cause of the problem and will now disappear. This next version, 5.1.2.0, will be released very soon, likely this week sometime.
----------------------------------------
Rick Alther
Former World Community Grid Developer
----------------------------------------
[Edit 1 times, last edit by Alther at Nov 28, 2006 6:55:58 PM]
[Nov 28, 2006 12:36:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
confused Re: Problem with program?

Alther,

Just my 2 cents on this 'race condition' u mentioned before .... could timings/sequences of UD/BOiNC events be impacted by how much it gets held up by security software? Whenever, wherever i continue to add exceptions for the UD & BOiNC processes in Firewall and Antivir (for localhost traffic exclusively). GC was logging huge amounts of disk hits, so told the antivir to ignore it. Every time a new version number in the process name and the merry go round starts again.

cheers
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Nov 28, 2006 12:57:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davidhobbs
Senior Cruncher
England
Joined: Dec 30, 2004
Post Count: 151
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

Thanks for your response, Rick.

Good luck with the update - we'll give it a good thrashing in due course!

David.
[Nov 28, 2006 6:17:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pcwr
Ace Cruncher
England
Joined: Sep 17, 2005
Post Count: 10903
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

Ha! I've seen enough wacky and buggy code in my time that not much surprises me anymore smile

But seriously, knowing the code, I listed those questions earlier because I knew what information I needed to narrow down the problem. Your answers really did narrow it down to only a couple of possibilities. It's almost certainly a race condition. Unfortunately, this problem coupled with the way the application works makes it very difficult to reproduce. But also because of that very nature, it means it will likely complete a second time around.

I've made a packaging change to the next release to turn off a UD feature I turned on for this project. Hopefully this feature was the cause of the problem and will now disappear. This next version, 5.1.2.0, will be released very soon, likely this week sometime.



Mine gets to 100%, then crashes and starts again. Gets to 100% then crashes again.

Any way of telling it to get new data of another project to progress?

regards,
Patrick
UK
----------------------------------------

[Nov 29, 2006 10:29:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

For a fresh one, go to taskmanager and kill the wcg_hpf2_rosetta.exe process. It will send the bad result back before. Taskmanager can be opened by either holding the Ctrl-Alt-Del keys simultaneously or right click with mouse at bottom of screen.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Nov 29, 2006 10:34:13 PM]
[Nov 29, 2006 10:32:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 46   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread