Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 28
Posts: 28   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2084 times and has 27 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

We are aware of the increase in work units failing for CEP2 and are working with the Harvard team to resolve the issue. The problem is that some work units cause a fatal error in the Q-Chem code. Ideally these work units work units would be identified ahead of time, but if that proves impossible we will make sure this is handled on the validation side. Until a more permanent solution can be found, work units that experience this problem are manually being given credit.

Seippel
(emphasis mine)

What better demonstration can one get how broken the credit 'system' is... [it's not even wrong]. Compare the light green lines of the Average Runtime Per Result chart with the one in the Credit Per Hour chart and draw your own conclusion:

http://bit.ly/WCGART
http://bit.ly/WCGCPH

When that block of time is carved out to fix this... maybe when Earth captures a second moon, or looses it and it is sloooowly, by a few centimeters per year, really! laughing [The first per most novel theory being that Earth nicked it's moon from Venus!]
[Oct 1, 2013 9:07:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

What better demonstration can one get how broken the credit 'system' is... [it's not even wrong].


Those seem to be rather harsh words for you, Rob?

While everything is not perfect, can we really expect perfection? The CEP2 project does present some particular difficulties for those machines not completing all the sub-jobs. The 12-hour cut-off is going to skew some of the numbers, and when the jobs completed are taken into consideration during validation I'm fairly sure that the jobs are given equal weight, which doesn't seem to reflect either their usefulness to the scientists or the relative amount of processor time needed for each job. But so what? We all have the option of running a different science (even if the choice is a little limited just now).
[Oct 1, 2013 9:48:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
littlepeaks
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 748
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

I really don't care about the credits -- 1,000,000 credits and $1.50 will get you a cup of coffee. I just wanted to hear that WCG acknowledges that there is a problem, and they are trying o do something about it (they acknowledged the problem and are trying to mitigate the problem -- thanks WCG). Also, I feel that we are wasting computer resources, because when one of these WUs errors out, your PC is no longer deemed reliable, and someone has to double up on your WUs, to make sure your PC is OK.
[Oct 1, 2013 4:29:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

We've finished updating the validator to handle these work units which were previously erroring out and you should automatically get credit for future work units where this occurs (previous work units have manually had credit given). The problem was that the current library contains one or more molecular building blocks which lead to compounds which are difficult to characterize quantum chemically. Information about which work units experience this will help the scientist learn from it for the future.

Seippel
[Oct 1, 2013 9:49:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

Thanks, Al.
[Oct 1, 2013 10:25:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

I've been dumping any WUs that I get & see have errored out with RC=0x01 for other wingmen. I've just about worn the paint off the "Abort" buttons in my BOINC Managers.
It doesn't save much wasted CPU time since they would have errored out very early in Job #0, but it helps get them through the WCG system sooner. It also puts more hours of CEP2 work into my WU caches and helps offset the limits imposed by the 16-WU max setting in Device Manager. Doing this does not seem to upset my "reliable device" status.

I don't know why the aborted WUs show in the Results Status pages as "Error", not "User Aborted", while WUs from wingmen are sometimes shown with that status.

The error rate for incoming WUs does seem to have decreased, but I've now had a couple of very short WUs that crunch all 16 sub-jobs in about 30min. Example: E215713_072_I.10.C7H3N3.00008305.0.set1d06
These may have been coming for a while and I hadn't noticed. I expect the techs and/or scientists have filters to flag such results as they come through.
[Oct 3, 2013 4:50:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

Well Rickjb, visit the Result Status page and filter on 'Error'. when you've overcome the 'huh', filter on 'User Aborted' and 'Server Aborted'. Then it's just 'Oh'.

The fact of the matter is, client 7 sends back codes 202 and 203 for these cases. Those were new for Server 700. There's a patch for that, but there's probably [major guess] other legacy things that need ripping out for that to apply properly and let users with v7 get the 'what you see is...' correctly. Revert back to client 6, and you can enjoy to the right text to go with the old codes for server/user aborted. Seeing enough wingmen that get it proper [and their logs showing having 6.10.58 or earlier].

For now, where's WCG's v7 client? Would it fix this issue? Doubt it because I ran the last WCG 7.2.7 test build and aborts still showed as error. Present, zero expectation, one or two surprises continue to being held back, maybe something by winter season, when there was to happen unspecified things to the website [read that somewhere], but then knreed wrote few days ago "the site redesign work we are starting". Someone commented as that being "Very Interesting".

edit [3]: BTW, I have no time nor intention to MM this situation. The high incidence rate leads or possibly has already led to pretty much most devices having lost their 'reliable' status for CEP2. Just had one going out after 2 minutes, the next was fetched and it's listed as having an up front wingman. Then looked at the other 6 on the RS list with In Progress. Except for one device doing 1 every 2 days or so, all are with upfront wingman... unreliable. And no, aborts [before starting] don't impact reliability, just the daily science WU quota, whatever it is for CEP2.
----------------------------------------
[Edit 3 times, last edit by Former Member at Oct 3, 2013 6:20:13 PM]
[Oct 3, 2013 5:15:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Wingmen error with RC = 0x100, mine errors with normally-valid RC = 0x1

When we release the website changes for the upcoming project release, the result status page will then show User Abort and Server Abort properly for the exit 202 and exit 203 status code.
[Oct 4, 2013 3:43:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 28   Pages: 3   [ Previous Page | 1 2 3 ]
[ Jump to Last Post ]
Post new Thread