Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 109
Posts: 109   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 734816 times and has 108 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

The techs have one WU where they're actually able to replicate a fail on their lab machines, sometimes. If they'd been able to determine the inroad to what the root cause, and then not affect the science result, they'd long done that.

Anyway, this was the last official reply: http://www.worldcommunitygrid.org/forums/wcg/...ead,27739_offset,0#258519
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Jan 31, 2010 9:36:14 AM]
[Jan 31, 2010 9:35:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello WCG.

Attention: uplinger -- WCG Tech
cross-reference: my [Dec 28, 2009 6:34:05 PM] post

From the latest batch of WCG WUs I have uploaded (about 12.5hrs earlier) to WCG, there was only one WU (with 129 mixed-project WCG WUs as trouble-free) with an error -- an HPF2 project WU, "nb947_00022_4" with details as follows:

Boinc_v6.2.28 display
----------------------
Name: nb947_00022_4
CPU time: 03:15:41
Progress: 100%
Report deadline: 2010.02.01.Mon 23:27:37
Status: Computation error

Snippets from "stdoutdae.txt":
------------------------------
28-Jan-2010 23:38:56 [World Community Grid] Starting nb947_00022_4
28-Jan-2010 23:38:56 [World Community Grid] Starting task nb947_00022_4 using hpf2 version 603
29-Jan-2010 03:02:23 [World Community Grid] Computation for task nb947_00022_4 finished
29-Jan-2010 03:02:23 [World Community Grid] Output file nb947_00022_4_0 for task nb947_00022_4 absent

Others:
--------
OS: 32-bit Vista Ultimate; SP2

The earlier post I did (Dec28,2009) was about some HPF2 WUs with stuck/frozen progress, else consumes unusually long crunchTimes (~30hrs, which, at that time, I opted to abort). Would the codebase of those HPF2 WUs be the same as those HPF2 WUs that also (sometimes) exhibit the above-mentioned error? If so, would there possibly be some connection between a suspect HPF2 WU 'getting lost' in a non-convergence zone on one extreme (exhibiting stuck/frozen progress), and on the other extreme, the WU somehow 'trapped' (which triggers an error) in the non-convergence?

Good day.
[Jan 31, 2010 3:29:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Yet to hear of a restarted HPF2 job that was stuck in endless loop to not complete successfully [see FAQs]. How this and the error phenomena could somehow be connected is hard to see.

1. The right out fails are within minutes of start, long before 1st checkpoint

2. The endless looping happening at any point in time, very possibly near the end zones just before the checkpoints. No one has as yet made a checkpoint connection that I can recollect. It's the I just happened to look discovery, where I run the RosettaView ** utility on the side since it monitors % progress on jobs and gives off alerts.

** No current source known for download.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jan 31, 2010 3:49:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
joeperry39@gmail.com
Advanced Cruncher
USA
Joined: Nov 22, 2006
Post Count: 140
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Runs OK on my Vista32 and on my XP32 but not so good on my Vista64. I have not tried Win7 yet so have nothing to offer.

My (osugrad) original reply on this subject:

I'm currently running HPF2 on an AMD Athlon II X2 235e Processor running at 2.70GHz with 6GB RAM and 64-bit Windows 7.

So far, no problems. Life is good! biggrin

BTW: I'm running BOINC 6.10.18 if that makes a difference.

I'm also running HPF2 on my older machine with an AMD Athlon XP 2400+ processor running at 2Ghz with 2 GB of RAM and 32-bit XP Home, SP-3. HPF2 has been running exclusively on both machines for quite some time now with absolutely No Errors returned. I also have BOINC rel 6.10.18 on the older computer.

I'm no expert on such matters, but can't help but wonder if it's somehow a combination of OS (and perhaps the version thereof), processor make and model, version of BOINC running the jobs and possibly other software that may be running on the various machines at the same time HPF2 is running. confused
----------------------------------------


"Everything in moderation, including moderation" -- Mark Twain
[Jan 31, 2010 5:23:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Anyway, I seem to recall having that problem with HCMD2 on vista and 7 ...
It happens with more than just the HCMD2 task on vista/7?
Or, am I misremembering that, also? tongue
Zoso, I don't know if you are misremembering or miskeying but HCMD2 has never shown any problem similar to these naughty HPF2 problems.
There have been a few teething problems at the beginning of the project and, later, several WUs have looked like they were stuck while computing very tough positions but we have not had any real case of looping yet. And no cases of failures right at the beginning of WUs either.
In fact, HCMD2 is a rather quiet project if you except the high variousness of durations which may disturb BOINC's ability to schedule jobs properly sometimes.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Feb 1, 2010 2:46:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello WCG.

The 'error phenomenon' surrounding WCG's HPF2 WUs seem to have, as I see it, affected a number of WCG crunchers. I thought I would provide some feedback to the WCG community regarding the crunching of HPF2 WUs in my machine in my hope that some useful data may be extracted therefrom that may serve as sign-posts in the search for a solution. Thus..

I just had one HPF2 WU (nc519_00084_13) whose progress was stuck for some time (since when, I don't know), and when I restarted my BOINC_v6.2.28, the said WU resumed incrementing its progress. My image-capture of the said WU in BOINC shows:
-- 04:54:42 and counting up (CPU time)
-- 46.125% and stuck (Progress)
-- 04:42:02 and counting up unevenly/irregularly (To competion)

After waiting for about 30minutes, with the progress still stuck at 46.125%, I opted to restart BOINC. Some minutes after that BOINC restart, results are as follows:
-- 02:01:49 and counting up (CPU time)
-- 51.67% and counting up (Progress)
-- 02:39:43 and counting down (To completion)

Finally, the said HPF2 WU completed error-free with a BOINC-indicated CPU time of -- 04:07:56.

Good day.
;
[Feb 1, 2010 8:41:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

[snip]
I'm no expert on such matters, but can't help but wonder if it's somehow a combination of OS (and perhaps the version thereof), processor make and model, version of BOINC running the jobs and possibly other software that may be running on the various machines at the same time HPF2 is running. confused


That's why I posted the beginning string of Messages from when BOINC last started up before the error; If reports don't include those data it just takes longer for the pattern (if there even is one) to reveal itself. It would take 1500 or so computers to test all the permutations from just 6 types of CPU, 5 manufacturers of motherboard/chipsets, 5 types of video and 10 different OS's (there are certainly more of all those variables), so it would be difficult, at best, to test every possible hardware combination in alpha OR beta before releasing the task WUs for public crunching. straight face

Unless I see another one that errors out by the end of this month, I'll have to agree that it was an AV issue. That's the only Win7 box I have and I'm not planning on paying $300 for Win7 Ultimate when it starts shutting down a month from now... I spent just over $300 assembling that machine (mostly used, off ebay; keyboard and USB hub new from amazon).

6.2.28 is still the 'official' windows version (with WCG customizations)... as of this second, anyway wink, though 6.10.xx has been testing if you check this thread in the agent forum. I'm running 6.10.25 on 2 fedora boxes (will probably put it on another one this week), but still have 6.2.28 on my windows machines.
[Feb 1, 2010 10:13:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

ZoSo, we've been battling with a numb ax on this for longer... add the number of concurrent HPF2 jobs to the variations v.v. multicores :-|

Quite a few would like to see an option Never send me this science in combo with the Send me something else if you don't have my fav project. BUT, the techs are working on a process that will remember if a client has continuous problems with a specific science and then will only send 1 periodically to check if the issue was fixed, if asked for of course.

edit: inserted continuous
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Feb 2, 2010 10:03:11 AM]
[Feb 1, 2010 10:59:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello all,

The error rate for the windows platform on this application does have our attention. We are currently working very hard to bring two more science applications online before we are able to dedicate more of our time to fixing this issue. This error is different than most we have seen in the past and requires more dedicated time debugging it than usual. Please be patient with us and we will fix this issue.

Thanks,
-Uplinger
[Feb 1, 2010 9:04:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello WCG.

Reference:
-- uplinger's [Feb 1, 2010 9:04:35 PM] post
-- Sekerob's [Feb 1, 2010 10:59:09 AM] post

Gentlemen:

I commend your efforts in dealing with the HPF2 issue. Sekerob's idea proposes to address those concerns that emphasize getting points off crunching WUs while uplinger's idea addresses the nuts-and-bolts of HPF2 WU processing itself with a view of hunting down the source of the issue and with that, proposes to address concerns that emphasize the importance of HPF2 WUs; that is, for crunchers who decided to stick with HPF2 WUs (because of the importance of the underlying science) despite the relatively few errors that may arise crunching them.

P.S. To this hour, I have finshed crunching 14 WCG HPF2 WUs, averaging 243_minutes-per-HPF2-WU. No HPF2 problems thus far (since my last report).

Good day.
;
[Feb 2, 2010 9:37:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 109   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread