Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 129
Posts: 129   Pages: 13   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8413 times and has 128 replies Next Thread
Viktors
Former World Community Grid Tech
Joined: Sep 20, 2004
Post Count: 653
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
A few unusual HPF2 work units

There have been various posts in the forums about long running, seemingly stuck, HPF2 work units, work units that quit early, and ones for which different agents get divergent answers. Most of the work units seem to be processing normally and are completing properly. But, we know that there are a few work units, which behave in unusual ways. There are different causes for this. For ones that seem stuck for a long time, the Rosetta program is probably trying to figure out if they are non-converging or not. Ones that quit early are probably subject to a subtle bug in Rosetta. To figure out how best to handle and fix these work units, we need to identify them so that we can do further testing and debugging on them. Instead of terminating problem work units, it would be useful to the tech team if the members identified the particular agent running the work unit (for example using the UD device ID number on the preferences window of the agent (checkmark icon)) and the UTC time and date at which it was running. We have asked the community advisors to help us collect information about these work units so we can use them in our investigations. We are unable to find all such unusual work units in our testing prior to launch because they are relatively rare. On the production grid, we process a tremendous amount of work each day and thus very subtle problems reveal themselves. Members who call attention to specific unusual work units will be doing a great favor to us. Our behind-the-scenes testing of problem work units is very time consuming. So if members simply let these unusual work units finish, we will be able to tell more about what was going on instead of losing that information.

We will probably be making some changes in Rosetta to speed up the detection of non-convergent work units, making the progress bar show finer progress increments or use some other means to show if the work unit is "stuck" or not. Finally, there seems to be a subtle bug, which aborts a few work units. Some of these work units have to run a long time to get to the point where the problem occurs and shortcuts seem to hide the bug in some cases. So the testing and debugging of these requires a lot of time. Please be patient with us as we take care of these problems. Furthermore, our team is extra busy, divided on project work, getting an additional research project ready for launch very soon. So, thank you for your patience and assistance.
[Jul 7, 2006 2:18:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
My unusual HPF2 work units

Since I downloaded the updated HPF2 program on 06/28/2006, I have had 1 error and 4 invalids. Here they are, together with my Computer ID, etc, from the Messages Tab in BOINC Manager.

Starting BOINC client version 5.4.9 for windows_intelx86
Processor: 1 AuthenticAMD AMD Sempron(tm) Processor 3100+
Memory: 895.48 MB physical, 2.12 GB virtual
Disk: 40.33 GB total, 29.13 GB free
Computer ID: 32229;

Work Unit --- Time Sent ------------- Time Returned – CPU Time

Error

za095_ 00852 07/05/2006 21:18:42 07/06/2006 09:23:15 3.68

Invalid

za053_ 00268 07/03/2006 08:35:42 07/03/2006 17:38:16 4.58
za083_ 00265 07/03/2006 01:38:35 07/03/2006 12:49:27 4.48
za082_ 00363 07/02/2006 10:18:15 07/02/2006 19:25:46 4.73
za067_ 00001 07/01/2006 03:25:32 07/01/2006 14:56:31 5.06
[Jul 7, 2006 3:59:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
olympic
Senior Cruncher
Joined: Jun 12, 2005
Post Count: 156
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Here are my invalid results since Rosetta 5.07 was released. I'm running BOINC 5.4.9. Both machines are dual core AMD Opterons with plenty of RAM and disk space.


Work Unit Device ID Sent Time Return Time
za070_ 00051 olympic-2da65e3 07/03/2006 07/03/2006
00:28:40 09:00:36

za050_ 00421 olympic-opty165 07/02/2006 07/03/2006
22:52:03 10:51:28

za082_ 00714 olympic-2da65e3 07/02/2006 07/03/2006
13:37:40 01:29:01

----------------------------------------

[Jul 7, 2006 5:47:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
debrouxl
Advanced Cruncher
France
Joined: Dec 31, 2004
Post Count: 61
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I still get abnormal WU terminations and corrupted double linked list messages in the terminal (but less frequently than it used to happen). At least two faults occurred right after the WU started, after 3 or 4 seconds of CPU time, if that can help. za081_00256_2 is one of them (I had to cancel it); I forgot to keep track of the number of the other one (but it's probably za078_0064, which terminated abnormally yesterday, with signal SIGSEGV).
This happens on a P4 2.6 GHz with 512 MB of RAM, running BOINC 5.4.9 under MEPIS/Debian GNU/Linux.

The older computer I put back on the grid in the past few days, which is Athlon 1 GHz with 128 MB of RAM, crunches FAAH WUs correctly, although the BOINC agent complains that the WU needs ~50% more memory than I have. I have seen that HPF2 requires more power or RAM than FAAH, so I guess that's why it has not received HPF2 WUs yet.
----------------------------------------
----------------------------------------
[Edit 1 times, last edit by DEBROUX Lionel at Jul 7, 2006 6:42:44 AM]
[Jul 7, 2006 6:41:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I have a HPF2 work unit at 0% after 31 hours running on an Opteron @ 2.6Ghz with 2GB of DDR.
[Jul 7, 2006 7:02:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
confused Re: A few unusual HPF2 work units

Hi Mark099,
Could you give us your Computer ID and the work unit name? Also, which client are you running, what CPU and how much RAM and Virtual Memory do you have?
Lawrence
[Jul 7, 2006 7:37:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Hi Mark099.

Please will you give us a little more information?

What is your Device ID, and what time did you download the work unit (UTC, or local time + timezone)?

After over 24 hours with no progress, you should feel free to abort the work unit.

Thank you.
[Jul 7, 2006 7:41:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Dirk Gently
Senior Cruncher
England
Joined: Mar 1, 2005
Post Count: 153
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I have wu currently at 62.5%. I dont know how long exactly it has been stuck, but processing time now totals 21:20 which is excessive for this machine.

I will let it continue, but it is probably wasting a lot of processing time. I think that a self abort feature is important - also maybe an indicator of how long it has been stuck.
----------------------------------------
[Jul 7, 2006 8:39:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I'm seeing quite a few problems. BOINC User ID 225561, Host ID 41341, AMD X2 4800+, 2GB RAM, XP with SP2.

Over the past 48 hours, I've turned in 30 results, out of which 8 are Valid, 3 are Invalid, 10 are Pending Validation, 7 are Inconclusive, and 2 are Errors. I also have 4 other Errors from previous days in my list.

Error 1: za078_ 00639, returned at 07/01 20:49:28, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies sent all also had error.

Error 2: za086_ 00440, returned at 07/03 16:58:13, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies all errors.

Error 3: za087_ 00098, returned at 07/04 12:52:57, Exception code: 0xc0000005, Exception address: 0x00488EB6, 5 other copies all errors.

Error 4: za092_ 00346, returned at 07/05 05:16:52, Exception code: 0xc0000005, Exception address: 0x00A880DD, 4 other copies errors, 1 still in progress.

Error 5: za095_ 00666, returned at 07/06 02:54:40, aborted by user after being stuck without progress for 8 hours. This is a very fast machine; HPF2 units have normally been finishing in 2-4 hours, and the longest I've seen is 5 hours. 3 other copies all still in progress, probably stuck.

Error 6: za074_ 00005, returned at 07/06 20:49:45, had to reboot system after lockup (don't know if caused by BOINC or not, no apparent cause), when restarted, unit immediately aborted with: "Incorrect function. (0x1) - exit code 1 (0x1) fasta file not found! ERROR:: Unable to obtain sequence information. fasta file must be provided."

Out of the Inconclusives, 2 are notable:

1: za058_ 00167, returned 07/06 07:10:56, 9 other Inconclusive returns, 1 in progress.

2: za059_ 00356, returned 07/04 09:58:28, 11 other Inconclusives, 1 in progress.

None of my finalized Valid or Invalid results took more than 5 or 6 results to reach a quorum. I suspect that if you haven't gotten 3 matches after 10 or 12 results, there's something fundamentally wrong.

Hope this helps.
[Jul 7, 2006 8:41:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Dirk, there is a self aborting, rather a 'skip forward' feature, except in this new science project, not all loops have been identified....i.e. if an attempt is set at 15 minutes, but there are many more attempts than anticipated in a segment of the WU, its going to take much longer. You dont want to skip to the next segment premature, thus its trial and error of where the sensible border lies, to let it rest.

An attempt in representing this in numbers: 1 HPP2 WU is 25 segments:, thus 4% is the factor by which you will see progress on the front GUI page. If progress is measured in the 'i' page as 0.1 percent step, you'd see only 40 times a progression in any segment **. If then a segment is made up of millions *** of possibilities, you'd see this percent progress go forward rarely.

Key is in UD agent to see some progress on the line graphs in the 'i' screen and the tenth of a percent progress bar there, rather than the
full percent progress on the front screen (thus 4 in the example for HPF2). If you dont see any for a long long time, half a day, you can abort by going to taskmanager and kill the UD_7xxxxxxx.exe process. It will send the file back and fetch a new one, and if there is a new version of the science.exe, it will be send with it. I believe to have seen that credit is still given, but the file has to be send back by the agent. Any other way of killing might not do that and simply fetch a new one and WCG recording the cancelled WU as a "no reply". That's waste.

Reports have been requested, to post device number and(approximate) UTC time, local time+zone the WU was downloaded and aborted. Then WCG tech can do detail analysis of the aborted WU, to further tweak the non-convergence loop exits.

In BOINC, its much easier to see. CPU time is clocked in the work tab, not Wallclock. If that is not progressing for a while, it's potentially stuck, but even then i've had ZA's that sat, no CPU time progress for a long while...several hours, NO OTHER PROCESSES HOGGING THE CPU. After that just finishing.

Sofar my log shows 22 HPF2 calculated thru, 2 invalids, 20 valid. I've cancelled 6 or 7 that were in queue but with v 5.06 in BOINC, to force receipt of new HPF2 with version 5.07 (*). Think the equivalent in UD agent is something like 5.05.02 v 5.05.03. You wonder about logs.....yes another very nice feature of BOINC.

* Long story, but any cancelling of 5.06 prior start may have caused excess copies in the 'inconclusive' saga to be crunched. (See Knreed explanations and why up to 12 copies could get send out on a single WU).

** I dont think in the 'i' screen the percent bar skips back to zero for each segment...never observed that. If it does, then obviously, there are 1000x0.1% progression steps visible in a segment. As a suggestion, since i think its up front known by WCG how many segments there are in a HPF1/2 WU, they could put the number in the 'i' page and show e.g. 'segmnt 12 of 25'. That gives together with a 1000x0.1% segment indication on the 'i' progress bar a much greater granularity.

*** One of the explanations of the why it takes much longer was described elsewhere as 'hi-resolution'.....versus HPF1, the nitty gritty detail looked for is much deeper than ever before down to atomic level.

Just my 2 pennies of how i translate it for myself. Anyone is free to skip, amplify or correct. Not interested in comments that start of with 'bad advise....'

time for coffee

ciao
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 7, 2006 9:54:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 129   Pages: 13   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread