Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Human Proteome Folding - Phase 2 Thread: Problem with program? |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 46
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
|
||
|
teletran
Senior Cruncher Joined: Jul 27, 2005 Post Count: 378 Status: Offline |
I had a result stuck at 100% (on a PC located away from home) last night but aborted the unit before I thought to get the info. Not a post with helpful info, I know, but just wanted to show another instance of this problem.
---------------------------------------- |
||
|
davidhobbs
Senior Cruncher England Joined: Dec 30, 2004 Post Count: 151 Status: Offline Project Badges: |
Hello again,
The work unit that I re-started has now completed successfully and a new one has been downloaded and started. Sort of good news that it worked this time, but bad news that we didn't reproduce the original problem. Sorry! David. |
||
|
Alther
Former World Community Grid Tech United States of America Joined: Sep 30, 2004 Post Count: 414 Status: Offline Project Badges: |
I now have a HPF2 UD work unit stuck at 100%, so I hope the following info might help Rick. Device ID 179553. Task run time almost 33 hours No result returned during this time No graphics viewable (get the "wait a few seconds then return" message) The PC will have been re-started at least four times, but using hibernation so the work unit itself should not have been interrupted. The process UD_9930506.exe is using 98% CPU time. I can see UD.exe also listed in task manager, but I can't see any other WCG processes. Shouldn't there be three of them? Aha! if I look at one of my other machines running the same project I see that ud_9930506 is listed but not consuming any significant processor time, and wcg_hpf2_rosetta.exe is the one using 98% CPU time. Perhaps this will give you a clue? I shut down the agent (by choosing EXIT from the sys tray icon) and then re-started it. The agent started again from 0% without contacting the grid server. Shall I kill this process and get a new work unit or do you want to see what happens when this one reaches 100% again? David. David, this is great info. This narrows it down quite a bit. No graphics are visible because the science app actually finished (thus there's nothing to display). So, what appears to have happened is that the science app ended and the UD process that's responsible for packaging the results up has gone into a loop for some reason. This might be because there is a bug in the science app which hasn't released a resource properly or a bug in UD (though if that's the case, I don't know why we haven't seen it before). Maybe it's a race condition between the two. In any case, it appears the best solution when you notice the 100% "stuck" condition is to simply exit UD and restart. Yes, it will restart the WU, but it will likely complete this time around. Thanks again for the detailed info.
Rick Alther
Former World Community Grid Developer |
||
|
davidhobbs
Senior Cruncher England Joined: Dec 30, 2004 Post Count: 151 Status: Offline Project Badges: |
Hi Rick,
----------------------------------------I'm pleased to think I may have helped you on your way to understanding the problem. I'm intrigued, however, that you don't seem surprised that it finishes successfully when re-started. My simple mind expects the application to reproduce the symptoms exactly each time if it is working with exactly the same data. (Oh, unless it is a race condition of course, as you suggest). David. [Edit 1 times, last edit by davidhobbs at Nov 27, 2006 8:36:11 PM] |
||
|
Alther
Former World Community Grid Tech United States of America Joined: Sep 30, 2004 Post Count: 414 Status: Offline Project Badges: |
Ha! I've seen enough wacky and buggy code in my time that not much surprises me anymore
----------------------------------------But seriously, knowing the code, I listed those questions earlier because I knew what information I needed to narrow down the problem. Your answers really did narrow it down to only a couple of possibilities. It's almost certainly a race condition. Unfortunately, this problem coupled with the way the application works makes it very difficult to reproduce. But also because of that very nature, it means it will likely complete a second time around. I've made a packaging change to the next release to turn off a UD feature I turned on for this project. Hopefully this feature was the cause of the problem and will now disappear. This next version, 5.1.2.0, will be released very soon, likely this week sometime.
Rick Alther
----------------------------------------Former World Community Grid Developer [Edit 1 times, last edit by Alther at Nov 28, 2006 6:55:58 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Alther,
----------------------------------------Just my 2 cents on this 'race condition' u mentioned before .... could timings/sequences of UD/BOiNC events be impacted by how much it gets held up by security software? Whenever, wherever i continue to add exceptions for the UD & BOiNC processes in Firewall and Antivir (for localhost traffic exclusively). GC was logging huge amounts of disk hits, so told the antivir to ignore it. Every time a new version number in the process name and the merry go round starts again. cheers
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
davidhobbs
Senior Cruncher England Joined: Dec 30, 2004 Post Count: 151 Status: Offline Project Badges: |
Thanks for your response, Rick.
Good luck with the update - we'll give it a good thrashing in due course! David. |
||
|
pcwr
Ace Cruncher England Joined: Sep 17, 2005 Post Count: 10903 Status: Offline Project Badges: |
Ha! I've seen enough wacky and buggy code in my time that not much surprises me anymore But seriously, knowing the code, I listed those questions earlier because I knew what information I needed to narrow down the problem. Your answers really did narrow it down to only a couple of possibilities. It's almost certainly a race condition. Unfortunately, this problem coupled with the way the application works makes it very difficult to reproduce. But also because of that very nature, it means it will likely complete a second time around. I've made a packaging change to the next release to turn off a UD feature I turned on for this project. Hopefully this feature was the cause of the problem and will now disappear. This next version, 5.1.2.0, will be released very soon, likely this week sometime. Mine gets to 100%, then crashes and starts again. Gets to 100% then crashes again. Any way of telling it to get new data of another project to progress? regards, Patrick UK |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
For a fresh one, go to taskmanager and kill the wcg_hpf2_rosetta.exe process. It will send the bad result back before. Taskmanager can be opened by either holding the Ctrl-Alt-Del keys simultaneously or right click with mouse at bottom of screen.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Nov 29, 2006 10:34:13 PM] |
||
|
|