World Community Grid - View Thread - You have to be kidding me...

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: You have to be kidding me...

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 19

[ ]

Author

This topic has been viewed 1601 times and has 18 replies

jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: You have to be kidding me...

After a work unit is in 'error' status a computer has its 'trusted' status removed and its work units must be checked by a wingman. Any subsequent work units returned will be put in Pver status for this check. Further work units sent to the computer will also be sent to a wingman. This will continue until 'trusted' status is achieved again.
.

And the WUs sent out for checking by a wingman have a 10 day deadline!

Given the increased incidence of Error and P/Ver when crunching CEP2 can the techs reduce this to the standard 3 deadline for repair work

Seems the repair deadline has been changed to 3.5 days smile

----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread

----------------------------------------
[Edit 1 times, last edit by jonnieb-uk at Aug 4, 2014 12:25:15 PM]

[Aug 4, 2014 11:38:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: You have to be kidding me...

'Trusted' or 'reliable' is at science app level and is maintained by always having the last 20+ serially rated with valid. This includes results from before a problem occurred that had been waiting on a wingman.

Regarding a comment of results having faultily gone to error, then after re-validation went valid, all those would have counted against the 20. Sadly though, those that already gone out up front with a wingman still go to waste in this respect, no retroactive reset. How many days, months or years worth of computing time went to the bin this way is the jackpot question.

Repairs have been for a longer time at 35 percent of the original deadline, posted by probably keithing reed, former technician, like here . The 30 percent was only briefly. And still today we're waiting on repairs getting at least the same deadline date as the original. At the ministry of silly walks, repairs most of the time are due before the original. With initial distribution of 2 you can have 1 with a 10 day deadline and the repair that went out the next day for a wingman fail due in 3.5 days, net the repairs are very often waiting on the original.

[Aug 4, 2014 3:15:39 PM]

Thyme Lawn
Cruncher
Joined: Dec 9, 2008
Post Count: 46
Status: Offline
Project Badges:

180 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

180 day badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

1 year badge for Uncovering Genome Mysteries

2 year badge for OpenPandemics - COVID-19


Re: You have to be kidding me...

I've had a series of E224* tasks which have failed with "RC = 0xc0000005" in job 1 and skipped jobs 2 to 15.

[22:26:44] Number of jobs = 16
[22:26:44] Starting job 0,CPU time has been restored to 0.000000.
[23:18:12] Finished Job #0
[23:18:12] Starting job 1,CPU time has been restored to 805.500000. Application exited with RC = 0xc0000005
[01:42:33] Finished Job #1
[01:42:33] Starting job 2,CPU time has been restored to 4117.640625.
[01:42:33] Skipping Job #2

I returned one of these tasks at 06:50:43 on 1st August which is PV, and tasks with the same processing pattern returned earlier than that were being validated.

That seems to have changed since the validator was modified. I've returned 2 similarly afflicted tasks today which were both marked as error and have just downloaded an E224*_6 task which, based on the 6 preceding failures, I'm sure will go the same way.

If the change is due to the validator update I guess the wingman for my PV task will be marked as an error after it's reported.

----------------------------------------

"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

[Aug 4, 2014 9:12:57 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: You have to be kidding me...

I have changed the total number of errors for cep2 down. We should not have 9 copies sent out again.

Thanks,
-Uplinger

[Aug 6, 2014 1:53:20 AM]

cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

90 day badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

5 year badge for Outsmart Ebola Together

5 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: You have to be kidding me...

I have two more CEP2 WUs that seem to be kaput. One has already plowed through it's 10 victims and the other one is getting started:
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 8-- 640 Error 8/5/14 10:00:09 8/5/14 15:17:57 0.90 30.7 / 30.7
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 9-- 640 Error 8/5/14 09:55:31 8/5/14 15:38:55 2.14 31.0 / 31.0
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 7-- 640 Error 8/4/14 07:59:10 8/5/14 08:10:03 0.85 25.2 / 25.2
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 6-- 640 Error 8/4/14 07:45:53 8/5/14 04:37:51 1.03 53.0 / 53.0
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 5-- 640 Error 8/3/14 08:53:06 8/4/14 07:37:20 1.01 31.9 / 31.9
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 4-- 640 Error 8/3/14 08:52:38 8/3/14 13:12:26 0.69 24.5 / 24.5
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 3-- 640 Error 8/2/14 20:21:55 8/3/14 08:43:59 1.51 27.3 / 27.3 <-me
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 2-- 640 Error 8/2/14 14:12:09 8/2/14 16:03:59 0.89 30.5 / 30.5
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 1-- 640 Error 8/2/14 14:02:35 8/3/14 02:33:12 0.78 37.6 / 37.6
E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 0-- 640 Error 8/1/14 15:39:15 8/2/14 14:00:08 1.15 44.1 / 44.1

E224265_ 958_ I.68.C54H28N4O10.00241615.1.set1d06_ 4-- - In Progress 8/5/14 23:16:59 8/15/14 23:16:59 0.00 0.0 / 0.0
E224265_ 958_ I.68.C54H28N4O10.00241615.1.set1d06_ 3-- 640 Error 8/5/14 18:37:46 8/5/14 23:11:10 0.00 0.0 / 0.0
E224265_ 958_ I.68.C54H28N4O10.00241615.1.set1d06_ 2-- - In Progress 8/5/14 18:31:19 8/15/14 18:31:19 0.00 0.0 / 0.0
E224265_ 958_ I.68.C54H28N4O10.00241615.1.set1d06_ 1-- 640 Error 8/1/14 15:49:16 8/1/14 19:45:36 1.58 28.4 / 0.0 <-me
E224265_ 958_ I.68.C54H28N4O10.00241615.1.set1d06_ 0-- 640 Error 8/1/14 15:40:00 8/5/14 15:35:34 1.10 38.3 / 0.0

I don't see any errors in the Results Log:

Result Log

Result Name: E224268_ 617_ I.68.C54H29N5O9.00404372.1.set1d06_ 3--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[00:58:50] Number of jobs = 16
[00:58:50] Starting job 0,CPU time has been restored to 0.000000.
[01:20:51] Finished Job #0
[01:20:51] Starting job 1,CPU time has been restored to 1141.896120.
Application exited with RC = 0xc0000005
[02:42:45] Finished Job #1
[02:42:45] Starting job 2,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #2
[02:42:45] Starting job 3,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #3
[02:42:45] Starting job 4,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #4
[02:42:45] Starting job 5,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #5
[02:42:45] Starting job 6,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #6
[02:42:45] Starting job 7,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #7
[02:42:45] Starting job 8,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #8
[02:42:45] Starting job 9,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #9
[02:42:45] Starting job 10,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #10
[02:42:45] Starting job 11,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #11
[02:42:45] Starting job 12,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #12
[02:42:45] Starting job 13,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #13
[02:42:45] Starting job 14,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #14
[02:42:45] Starting job 15,CPU time has been restored to 5430.597611.
[02:42:45] Skipping Job #15
02:42:49 (2564): called boinc_finish

</stderr_txt>
]]>

Posted all the above if it's any help to anybody.

CJSL

Crunching for a brighter tomorrow...

----------------------------------------

I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team

[Aug 6, 2014 2:09:27 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: You have to be kidding me...

I got one, I'm the tenth to run it all others erred.

It only ran one job skipped the rest see below.

E224270_ 215_ I.66.C55H35N5O6.00310371.1.set1d06_ 9-- 640 Pending Validation 8/6/14 03:30:43 8/6/14 05:37:19 0.71 37.4 / 0.0

Result Log

Result Name: E224270_ 215_ I.66.C55H35N5O6.00310371.1.set1d06_ 9--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[14:27:57] Number of jobs = 16
[14:27:57] Starting job 0,CPU time has been restored to 0.000000.
[14:28:00] Starting new Job
[14:28:01] Qink name = fldman
[14:28:02] Qink name = gesman
[14:28:02] Qink name = scfman
[14:41:02] Qink name = anlman
[14:41:20] End of Job
[14:41:21] Finished Job #0
[14:41:21] Starting job 1,CPU time has been restored to 738.748000.
[14:41:21] Starting new Job
[14:41:21] Qink name = fldman
[14:41:24] Qink name = gesman
[14:41:25] Qink name = scfman
[15:03:11] Qink name = anlman
Application exited with RC = 0x8b
[15:12:11] Finished Job #1
[15:12:11] Starting job 2,CPU time has been restored to 2478.512000.
[15:12:11] Skipping Job #2
[15:12:11] Starting job 3,CPU time has been restored to 2478.512000.

[Aug 6, 2014 5:47:39 AM]

jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:


Re: You have to be kidding me...

It seems that the deadline for CEP2 Repair work has been moved back out to 10 days rather than 35% confused

E225052_ 949_ S.252.C31H23N5O1.XKLYIVBOTGFSMG-UHFFFAOYSA-N.4_ s1_ 14_ 1--
- In Progress 06/08/14 21:25:13 16/08/14 21:25:13 0.00 0.0 / 0.0
E225052_ 949_ S.252.C31H23N5O1.XKLYIVBOTGFSMG-UHFFFAOYSA-N.4_ s1_ 14_ 0--
640 Pending Verification 05/08/14 20:46:52 06/08/14 21:15:07 6.72 261.0 / 0.0

----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread

[Aug 6, 2014 10:58:57 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: You have to be kidding me...

jonnieb,

You are correct. The jobs sent out right now do not have the reliable setting to them. We are still working through member computers with not being reliable from the validation issues last week. This is what was causing CEP2 to appear out of work when it actually wasn't. I thought we had cleared them the other day, then got bitten by them again, thus users seeing no work available. I'm going to let this run for a bit to hopefully get more reliable hosts for CEP2.

Thanks,
-Uplinger

[Aug 7, 2014 5:06:18 AM]

jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:


Re: You have to be kidding me...

Keith
Thanks for the explanation smile

----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread

[Aug 7, 2014 9:10:11 AM]

[ ]