Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 129
|
![]() |
Author |
|
Viktors
Former World Community Grid Tech Joined: Sep 20, 2004 Post Count: 653 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There have been various posts in the forums about long running, seemingly stuck, HPF2 work units, work units that quit early, and ones for which different agents get divergent answers. Most of the work units seem to be processing normally and are completing properly. But, we know that there are a few work units, which behave in unusual ways. There are different causes for this. For ones that seem stuck for a long time, the Rosetta program is probably trying to figure out if they are non-converging or not. Ones that quit early are probably subject to a subtle bug in Rosetta. To figure out how best to handle and fix these work units, we need to identify them so that we can do further testing and debugging on them. Instead of terminating problem work units, it would be useful to the tech team if the members identified the particular agent running the work unit (for example using the UD device ID number on the preferences window of the agent (checkmark icon)) and the UTC time and date at which it was running. We have asked the community advisors to help us collect information about these work units so we can use them in our investigations. We are unable to find all such unusual work units in our testing prior to launch because they are relatively rare. On the production grid, we process a tremendous amount of work each day and thus very subtle problems reveal themselves. Members who call attention to specific unusual work units will be doing a great favor to us. Our behind-the-scenes testing of problem work units is very time consuming. So if members simply let these unusual work units finish, we will be able to tell more about what was going on instead of losing that information.
We will probably be making some changes in Rosetta to speed up the detection of non-convergent work units, making the progress bar show finer progress increments or use some other means to show if the work unit is "stuck" or not. Finally, there seems to be a subtle bug, which aborts a few work units. Some of these work units have to run a long time to get to the point where the problem occurs and shortcuts seem to hide the bug in some cases. So the testing and debugging of these requires a lot of time. Please be patient with us as we take care of these problems. Furthermore, our team is extra busy, divided on project work, getting an additional research project ready for launch very soon. So, thank you for your patience and assistance. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Since I downloaded the updated HPF2 program on 06/28/2006, I have had 1 error and 4 invalids. Here they are, together with my Computer ID, etc, from the Messages Tab in BOINC Manager.
Starting BOINC client version 5.4.9 for windows_intelx86 Processor: 1 AuthenticAMD AMD Sempron(tm) Processor 3100+ Memory: 895.48 MB physical, 2.12 GB virtual Disk: 40.33 GB total, 29.13 GB free Computer ID: 32229; Work Unit --- Time Sent ------------- Time Returned – CPU Time Error za095_ 00852 07/05/2006 21:18:42 07/06/2006 09:23:15 3.68 Invalid za053_ 00268 07/03/2006 08:35:42 07/03/2006 17:38:16 4.58 za083_ 00265 07/03/2006 01:38:35 07/03/2006 12:49:27 4.48 za082_ 00363 07/02/2006 10:18:15 07/02/2006 19:25:46 4.73 za067_ 00001 07/01/2006 03:25:32 07/01/2006 14:56:31 5.06 |
||
|
olympic
Senior Cruncher Joined: Jun 12, 2005 Post Count: 156 Status: Offline |
Here are my invalid results since Rosetta 5.07 was released. I'm running BOINC 5.4.9. Both machines are dual core AMD Opterons with plenty of RAM and disk space.
----------------------------------------
![]() |
||
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I still get abnormal WU terminations and corrupted double linked list messages in the terminal (but less frequently than it used to happen). At least two faults occurred right after the WU started, after 3 or 4 seconds of CPU time, if that can help. za081_00256_2 is one of them (I had to cancel it); I forgot to keep track of the number of the other one (but it's probably za078_0064, which terminated abnormally yesterday, with signal SIGSEGV).
----------------------------------------This happens on a P4 2.6 GHz with 512 MB of RAM, running BOINC 5.4.9 under MEPIS/Debian GNU/Linux. The older computer I put back on the grid in the past few days, which is Athlon 1 GHz with 128 MB of RAM, crunches FAAH WUs correctly, although the BOINC agent complains that the WU needs ~50% more memory than I have. I have seen that HPF2 requires more power or RAM than FAAH, so I guess that's why it has not received HPF2 WUs yet. ---------------------------------------- [Edit 1 times, last edit by DEBROUX Lionel at Jul 7, 2006 6:42:44 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have a HPF2 work unit at 0% after 31 hours running on an Opteron @ 2.6Ghz with 2GB of DDR.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi Mark099,
Could you give us your Computer ID and the work unit name? Also, which client are you running, what CPU and how much RAM and Virtual Memory do you have? Lawrence |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi Mark099.
Please will you give us a little more information? What is your Device ID, and what time did you download the work unit (UTC, or local time + timezone)? After over 24 hours with no progress, you should feel free to abort the work unit. Thank you. |
||
|
Dirk Gently
Senior Cruncher England Joined: Mar 1, 2005 Post Count: 153 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have wu currently at 62.5%. I dont know how long exactly it has been stuck, but processing time now totals 21:20 which is excessive for this machine.
----------------------------------------I will let it continue, but it is probably wasting a lot of processing time. I think that a self abort feature is important - also maybe an indicator of how long it has been stuck. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm seeing quite a few problems. BOINC User ID 225561, Host ID 41341, AMD X2 4800+, 2GB RAM, XP with SP2.
Over the past 48 hours, I've turned in 30 results, out of which 8 are Valid, 3 are Invalid, 10 are Pending Validation, 7 are Inconclusive, and 2 are Errors. I also have 4 other Errors from previous days in my list. Error 1: za078_ 00639, returned at 07/01 20:49:28, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies sent all also had error. Error 2: za086_ 00440, returned at 07/03 16:58:13, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies all errors. Error 3: za087_ 00098, returned at 07/04 12:52:57, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies all errors. Error 4: za092_ 00346, returned at 07/05 05:16:52, Exception code: 0xc0000005, Exception address: 0x00A880DD, 4 other copies errors, 1 still in progress. Error 5: za095_ 00666, returned at 07/06 02:54:40, aborted by user after being stuck without progress for 8 hours. This is a very fast machine; HPF2 units have normally been finishing in 2-4 hours, and the longest I've seen is 5 hours. 3 other copies all still in progress, probably stuck. Error 6: za074_ 00005, returned at 07/06 20:49:45, had to reboot system after lockup (don't know if caused by BOINC or not, no apparent cause), when restarted, unit immediately aborted with: "Incorrect function. (0x1) - exit code 1 (0x1) fasta file not found! ERROR:: Unable to obtain sequence information. fasta file must be provided." Out of the Inconclusives, 2 are notable: 1: za058_ 00167, returned 07/06 07:10:56, 9 other Inconclusive returns, 1 in progress. 2: za059_ 00356, returned 07/04 09:58:28, 11 other Inconclusives, 1 in progress. None of my finalized Valid or Invalid results took more than 5 or 6 results to reach a quorum. I suspect that if you haven't gotten 3 matches after 10 or 12 results, there's something fundamentally wrong. Hope this helps. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Dirk, there is a self aborting, rather a 'skip forward' feature, except in this new science project, not all loops have been identified....i.e. if an attempt is set at 15 minutes, but there are many more attempts than anticipated in a segment of the WU, its going to take much longer. You dont want to skip to the next segment premature, thus its trial and error of where the sensible border lies, to let it rest.
----------------------------------------An attempt in representing this in numbers: 1 HPP2 WU is 25 segments:, thus 4% is the factor by which you will see progress on the front GUI page. If progress is measured in the 'i' page as 0.1 percent step, you'd see only 40 times a progression in any segment **. If then a segment is made up of millions *** of possibilities, you'd see this percent progress go forward rarely. Key is in UD agent to see some progress on the line graphs in the 'i' screen and the tenth of a percent progress bar there, rather than the full percent progress on the front screen (thus 4 in the example for HPF2). If you dont see any for a long long time, half a day, you can abort by going to taskmanager and kill the UD_7xxxxxxx.exe process. It will send the file back and fetch a new one, and if there is a new version of the science.exe, it will be send with it. I believe to have seen that credit is still given, but the file has to be send back by the agent. Any other way of killing might not do that and simply fetch a new one and WCG recording the cancelled WU as a "no reply". That's waste. Reports have been requested, to post device number and(approximate) UTC time, local time+zone the WU was downloaded and aborted. Then WCG tech can do detail analysis of the aborted WU, to further tweak the non-convergence loop exits. In BOINC, its much easier to see. CPU time is clocked in the work tab, not Wallclock. If that is not progressing for a while, it's potentially stuck, but even then i've had ZA's that sat, no CPU time progress for a long while...several hours, NO OTHER PROCESSES HOGGING THE CPU. After that just finishing. Sofar my log shows 22 HPF2 calculated thru, 2 invalids, 20 valid. I've cancelled 6 or 7 that were in queue but with v 5.06 in BOINC, to force receipt of new HPF2 with version 5.07 (*). Think the equivalent in UD agent is something like 5.05.02 v 5.05.03. You wonder about logs.....yes another very nice feature of BOINC. * Long story, but any cancelling of 5.06 prior start may have caused excess copies in the 'inconclusive' saga to be crunched. (See Knreed explanations and why up to 12 copies could get send out on a single WU). ** I dont think in the 'i' screen the percent bar skips back to zero for each segment...never observed that. If it does, then obviously, there are 1000x0.1% progression steps visible in a segment. As a suggestion, since i think its up front known by WCG how many segments there are in a HPF1/2 WU, they could put the number in the 'i' page and show e.g. 'segmnt 12 of 25'. That gives together with a 1000x0.1% segment indication on the 'i' progress bar a much greater granularity. *** One of the explanations of the why it takes much longer was described elsewhere as 'hi-resolution'.....versus HPF1, the nitty gritty detail looked for is much deeper than ever before down to atomic level. Just my 2 pennies of how i translate it for myself. Anyone is free to skip, amplify or correct. Not interested in comments that start of with 'bad advise....' time for coffee ciao
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
![]() |