World Community Grid - View Thread - anyone else seeing these kinds of errors? I'm getting tons of them.

World Community Grid Forums

Category: Completed Research

Forum: Human Proteome Folding - Phase 2

Thread: anyone else seeing these kinds of errors? I'm getting tons of them.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 109

[ ]

Author

This topic has been viewed 733631 times and has 108 replies

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Your perception of your stats are still not appreciating the PV jail numbers. About day 4 you should reach a fairly constant, but HPF2 is extra special because it is highly susceptible to office Monday-Friday crunch contributing, adding to that the quorum 15 mechanism. Just look at this roller coaster
http://i137.photobucket.com/albums/q210/Sekerob/WCGHPF2ProdChart.png

and compare that to the project continuity of the others.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 4, 2010 6:32:52 PM]

[Mar 4, 2010 6:31:42 PM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

a
aa
aaa
aaaa
aaaaa
aaaaaa
aaaaaaa
aaaaaaaa
aaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...............................

----------------------------------------

[Mar 4, 2010 8:47:03 PM]

rilian
Veteran Cruncher
Ukraine - we rule!
Joined: Jun 17, 2007
Post Count: 1453
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

One of my hosts started getting random errors today/yesterday

Result Name: ne416_ 00037_ 6--
<core_client_version>6.10.32</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C81A3E1

Engaging BOINC Windows Runtime Debugger...

++ there is a long debug info below

WU quited after 60 hours of crunching crying

Beside this WU, some other quite after from 0.02 hours up to 7.xx hours with same error

<core_client_version>6.10.32</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

</stderr_txt>
]]>

ne751_ 00038_ 3-- computername Error 3/5/10 11:12:52 3/5/10 11:15:15 0.02 0.1 / 0.0
ne752_ 00050_ 9-- computername Error 3/5/10 11:12:25 3/5/10 11:15:15 0.02 0.2 / 0.0
ne753_ 00044_ 7-- computername Error 3/5/10 11:12:25 3/5/10 11:15:15 0.02 0.1 / 0.0
ne741_ 00042_ 13-- computername Error 3/5/10 11:12:25 3/5/10 11:15:15 0.02 0.2 / 0.0
ne735_ 00006_ 3-- computername Error 3/5/10 07:00:19 3/5/10 11:12:24 3.84 32.0 / 0.0
ne727_ 00029_ 17-- computername Error 3/5/10 07:00:19 3/5/10 11:12:24 3.37 28.0 / 0.0
ne691_ 00073_ 18-- computername Pending Validation 3/4/10 15:39:23 3/5/10 07:00:19 13.28 110.5 / 0.0
ne691_ 00043_ 2-- computername Error 3/4/10 15:39:23 3/5/10 11:12:24 7.37 61.3 / 0.0
ne691_ 00041_ 10-- computername Error 3/4/10 15:39:23 3/5/10 11:12:24 5.82 48.4 / 0.0

confused

i can get messages log form this machine later...

----------------------------------------

Ukraine - Украина! Присоединяйтесь! http://distributed.org.ua

----------------------------------------
[Edit 2 times, last edit by rilian at Mar 5, 2010 3:24:00 PM]

[Mar 5, 2010 3:20:25 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

rillian,

Please see my HPF2 forum post of today... BOINCTasks is getting an alert system to warn for tasks stuck in a loop. HPF2 is the only one I know at WCG that does that, rarely.

I'm for now using RosettaView (no longer available on the intertube)

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 5, 2010 3:56:09 PM]

[Mar 5, 2010 3:52:55 PM]

rilian
Veteran Cruncher
Ukraine - we rule!
Joined: Jun 17, 2007
Post Count: 1453
Status: Offline
Project Badges:


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Sekerob, thanks, i've seen this post ( http://www.worldcommunitygrid.org/forums/wcg/...24380_lastpage,yes#270466 ) about BOINCTasks tool

unfortunately i have quite remote machines so even if it will warn me on some WU, i could not do anything in time

Is there anything i can do, except not running HPF2 on that machine?

it is

GenuineIntel Intel(R) Xeon(TM) CPU 3.00GHz [x86 Family 15 Model 4 Stepping 10]
Microsoft Windows Server 2003
Enterprise Server x86 Edition, Service Pack 2, (05.02.3790.00)

----------------------------------------

Ukraine - Украина! Присоединяйтесь! http://distributed.org.ua

[Mar 5, 2010 8:43:46 PM]

robertmiles
Senior Cruncher
US
Joined: Apr 16, 2008
Post Count: 443
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

45 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

1 year badge for Africa Rainfall Project


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

I got somewhat similar errors on my laptop for a while, before I decided I had credit for enough of this type workunits for now and switched all three of my computers to another WCG subproject.

A few details about that computer:
64-bit Windows Vista SP2
BOINC 6.10.18
several other BOINC projects,including the GPU type and the full CPU and full GPU type (Einstein)
8 GB memory for 2 CPU cores; BOINC allowed to use only 40% of it due to problems on my other two computers if more allowed
Keep workunits in memory when suspended turned off, again due to problems on my other two computers
BOINC allowed to use 60% of the CPU time, compared to 100% on the two computers with better results on this subproject
Errors generally occur well after the workunit is started, when it's trying to resume from a checkpoint

The GPU and Einstein workunits tend to suspend themselves whenever I use the keyboard or the touchpad. For Einstein workunits, at least, this lets a CPU-only workunit get a much shorter than usual piece of a timeslot; I suspect that could cause problems for CPU workunits with infrequent checkpoints if BOINC counts those pieces the same as a full timeslot. The GPU workunits resume within minutes after I stop using the keyboard and the touchpad; so do Einstein workunits, even when that requires an early suspension of a CPU-only workunit about the same as some other workunit going into high-priority mode.

I've never been interested in overclocking enough to find instructions on how to do it, but that laptop is rather hot to put on my lap even with the current settings, and tends to use the high speed of its fan much more often now that I've found some GPU projects compatible with its GPU board (a G105M).

----------------------------------------
[Edit 1 times, last edit by robertmiles at Mar 7, 2010 5:17:05 AM]

[Mar 7, 2010 4:17:50 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

May have mentioned this before, but with me quad W7-64 bit and 64 bit client (6.10.36) was observing a pattern of HPF2 fails, but exclusively when running in combination with AutoDock sciences. First saw a number of 2 minute error-outs and one 50 minutes into the job all with the same lines in the result log ending in /401 whilst HFCC was running, so deselect that project and by the time none we left in the mix, all ran happy with RICE, HCC, HCMD2. Then yesterday I had a few FAAH come and forced 1 to start. Sure enough whilst running several HPF2 jobs failed in the familiar 2 minute style. The FAAH finished and since returned 4 more without issue, 2 still in progress with 2 hours under the buckle.

Thus, anyone else having this particular experience or can reconstruct this to have happened when listing out the Result Status pages or BOINCTasks history (v 0.45)? A Sample:

First set when a FAAH ran:

World Community Grid 6.03 hpf2 nf439_00014 06:35:57 (06:13:36) 17-03-2010 10:10 17-03-2010 10:10 Reported: Ok
World Community Grid 6.06 hcc1 X0000090400045200707131445 04:56:10 (04:46:55) 17-03-2010 09:57 17-03-2010 09:57 Reported: Ok
World Community Grid 6.06 hcc1 X0000090400129200708021915 04:53:47 (04:44:07) 17-03-2010 09:24 17-03-2010 09:24 Reported: Ok
World Community Grid 6.06 hcc1 X0000090400276200707121410 04:30:44 (04:28:19) 17-03-2010 06:31 17-03-2010 06:31 Reported: Ok
World Community Grid 6.06 hcc1 X0000090400316200707121409 04:34:58 (04:32:13) 17-03-2010 05:01 17-03-2010 05:01 Reported: Ok
World Community Grid 6.06 hcc1 X0000090400589200707121404 04:38:46 (04:36:22) 17-03-2010 04:30 17-03-2010 04:31 Reported: Ok
World Community Grid 6.03 hpf2 nf439_00010 05:32:54 (05:31:14) 17-03-2010 03:34 17-03-2010 03:34 Reported: Ok
World Community Grid 6.03 hpf2 nf439_00015 05:48:11 (05:44:52) 17-03-2010 02:00 17-03-2010 02:00 Reported: Ok
World Community Grid 6.03 hpf2 nf406_00064 04:29:17 (04:24:05) 16-03-2010 23:51 16-03-2010 23:52 Reported: Ok
World Community Grid 6.07 faah faah11385_ZINC11800521_xMut_md21780_02 06:24:48 (06:14:08) 16-03-2010 23:42 16-03-2010 23:43 Reported: Ok
World Community Grid 6.03 hpf2 nf439_00011 00:01:16 (00:01:15) 16-03-2010 22:01 16-03-2010 22:02 Reported: Computation error (1,)
World Community Grid 6.06 hcc1 X0000090370235200708021803 04:45:19 (04:38:38) 16-03-2010 22:00 16-03-2010 22:00 Reported: Ok
World Community Grid 6.03 hpf2 nf406_00058 05:25:20 (04:50:20) 16-03-2010 20:12 16-03-2010 20:12 Reported: Ok
World Community Grid 6.03 hpf2 nf405_00046 06:10:32 (05:41:34) 16-03-2010 19:48 16-03-2010 19:48 Reported: Ok
World Community Grid 6.03 hpf2 nf389_00078 05:25:14 (05:11:54) 16-03-2010 17:55 16-03-2010 17:56 Reported: Ok
World Community Grid 6.03 hpf2 nf439_00023 00:01:20 (00:01:12) 16-03-2010 17:18 16-03-2010 17:20 Reported: Computation error (1,)
World Community Grid 6.03 hpf2 nf382_00032 05:33:06 (05:25:04) 16-03-2010 16:36 16-03-2010 16:36 Reported: Ok
World Community Grid 6.06 hcc1 X0000090281140200708021314 04:58:47 (04:42:34) 16-03-2010 13:40 16-03-2010 13:40 Reported: Ok
World Community Grid 6.03 hpf2 nf380_00030 05:12:20 (04:50:29) 16-03-2010 13:36 16-03-2010 13:37 Reported: Ok

Second set when several HFCC ran:

World Community Grid 6.06 hcc1 X0000084650807200703070838 04:35:01 (04:33:19) 10-03-2010 05:54 10-03-2010 05:55 Reported: Ok
World Community Grid 6.03 hpf2 ne861_00046 04:53:04 (04:50:50) 10-03-2010 03:52 10-03-2010 03:52 Reported: Ok
World Community Grid 6.03 hpf2 ne863_00000 05:51:41 (05:49:11) 10-03-2010 01:19 10-03-2010 01:20 Reported: Ok
World Community Grid 6.03 hpf2 ne858_00011 05:07:29 (05:04:20) 09-03-2010 23:00 09-03-2010 23:06 Reported: Ok
World Community Grid 6.03 hpf2 ne858_00042 05:06:34 (05:03:57) 09-03-2010 22:59 09-03-2010 23:06 Reported: Ok
World Community Grid 6.03 hpf2 ne843_00019 05:22:19 (05:16:21) 09-03-2010 22:27 09-03-2010 23:06 Reported: Ok
World Community Grid 6.03 hpf2 ne859_00044 05:47:49 (05:14:02) 09-03-2010 19:28 09-03-2010 19:28 Reported: Ok
World Community Grid 6.06 hcc1 X0000084630459200703161915 05:17:39 (05:01:45) 09-03-2010 17:52 09-03-2010 17:53 Reported: Ok
World Community Grid 6.03 hpf2 ne853_00105 06:44:52 (06:27:13) 09-03-2010 17:52 09-03-2010 17:53 Reported: Ok
World Community Grid 6.06 hcc1 X0000084640008200703021829 05:34:07 (05:19:53) 09-03-2010 16:55 09-03-2010 16:55 Reported: Ok
World Community Grid 6.03 hpf2 ne820_00038 06:26:45 (06:11:06) 09-03-2010 13:07 09-03-2010 13:07 Reported: Ok
World Community Grid 6.03 hpf2 ne852_00027 00:01:11 (00:01:02) 09-03-2010 12:35 09-03-2010 12:36 Reported: Computation error (1,)
World Community Grid 6.10 hfcc HFCC_s2_00419591_s2_0001 09:32:57 (09:20:57) 09-03-2010 12:33 09-03-2010 12:34 Reported: Ok
World Community Grid 6.10 hfcc HFCC_s2_00418320_s2_0001 10:22:15 (10:10:21) 09-03-2010 11:53 09-03-2010 11:54 Reported: Ok
World Community Grid 6.03 Human Proteome Folding - Phase 2 ne853_00092 00:50:50 (00:48:20) 09-03-2010 10:34 09-03-2010 10:36 Reported: Computation error (1,)
World Community Grid 6.03 Human Proteome Folding - Phase 2 ne825_00040 00:01:26 (00:01:12) 09-03-2010 09:38 09-03-2010 09:39 Reported: Computation error (1,)
World Community Grid 6.06 Help Conquer Cancer X0000084600343200703161822 05:08:41 (05:00:26) 09-03-2010 09:37 09-03-2010 09:37 Reported: Ok
World Community Grid 6.03 Human Proteome Folding - Phase 2 ne820_00036 00:01:19 (00:01:15) 09-03-2010 05:53 09-03-2010 05:54 Reported: Computation error (1,)
World Community Grid 6.10 Help Fight Childhood Cancer HFCC_s2_00418006_s2_0001 07:54:03 (07:50:56) 09-03-2010 05:52 09-03-2010 05:52 Reported: Ok
World Community Grid 6.03 Human Proteome Folding - Phase 2 ne816_00007 05:46:25 (05:42:25) 09-03-2010 04:28 09-03-2010 04:28 Reported: Ok

To emphasize, when no AutoDock jobs ran concurrently, there was a 100% hpf2 success rate, to include the periodic preemptive schedule in of a 300 hour CPDN model.

edit: italics on jobs of interest.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 17, 2010 10:49:08 AM]

[Mar 17, 2010 10:48:29 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

I will not be able to try out your sucess formula on my Win7-64 for another 8-9 hours but I will certainly be testing this tonight!

[Mar 17, 2010 12:38:02 PM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Now that I stopped HPFP2 I wanted to give a more detailed look to the thousands of errors I got and unfortunately it is not only between one or two minutes that it fails.
Here under a list of errors with the highest crunch time values before failing. I do not care the loss of points, but surely I am not very happy with the loss of time.

ne580_ 00022_ 13-- Ceres Error 03.03.10 02:35:43 04.03.10 21:06:48 31.03 705.9 / 0.0
nf185_ 00033_ 5-- Uranus Error 12.03.10 01:22:10 13.03.10 23:49:13 30.49 681.8 / 0.0
ne998_ 00018_ 18-- Ceres Error 09.03.10 09:26:57 11.03.10 11:47:13 30.09 695.0 / 0.0
nf023_ 00025_ 11-- Pluto Error 09.03.10 16:50:42 11.03.10 12:41:16 29.80 695.0 / 0.0
ne867_ 00077_ 13-- Pluto Error 07.03.10 08:13:40 09.03.10 03:58:41 29.49 770.0 / 0.0
ne870_ 00038_ 14-- Ceres Error 07.03.10 10:27:16 09.03.10 03:58:16 29.26 658.5 / 0.0
ne684_ 00019_ 7-- Saturn Error 04.03.10 13:15:55 06.03.10 01:44:05 28.86 723.5 / 0.0
nf225_ 00030_ 10-- Ceres Error 12.03.10 17:22:58 14.03.10 10:36:41 28.57 641.8 / 0.0
ne956_ 00041_ 1-- Mercury Error 08.03.10 18:35:22 10.03.10 09:11:37 28.20 700.6 / 0.0
ne762_ 00044_ 1-- Saturn Error 05.03.10 14:15:05 07.03.10 11:24:50 26.73 670.0 / 0.0
nf049_ 00036_ 2-- Mars Error 10.03.10 01:28:50 10.03.10 23:17:40 5.00 120.7 / 0.0
nf086_ 00088_ 1-- Mars Error 10.03.10 13:14:57 11.03.10 11:31:40 4.32 101.3 / 0.0
ne859_ 00088_ 20-- Terra Error 07.03.10 09:20:54 08.03.10 05:26:53 3.40 79.4 / 0.0
ne845_ 00043_ 3-- Ceres Error 06.03.10 21:48:37 07.03.10 12:22:30 3.24 76.0 / 0.0
nf116_ 00029_ 4-- Mars Error 10.03.10 23:17:42 11.03.10 15:08:24 3.21 76.0 / 0.0
ne768_ 00051_ 4-- Pluto Error 05.03.10 16:41:01 06.03.10 04:47:06 2.29 56.3 / 0.0
ne963_ 00031_ 14-- Jupiter Error 08.03.10 20:47:25 09.03.10 23:52:13 2.26 55.4 / 0.0
ne851_ 00028_ 10-- Mars Error 07.03.10 00:23:02 07.03.10 13:03:27 2.10 50.6 / 0.0
nf030_ 00005_ 4-- Ceres Error 09.03.10 19:03:08 10.03.10 10:01:26 0.77 17.5 / 0.0

----------------------------------------

[Mar 17, 2010 2:02:38 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hypernova,

Suggest you look in the log detail. The ones, say up to 5 hours are all the /401 fails or an absent output file type **. Those with the 27-31 hours are probably time out loopers, when they've computed like 10x the fpops amount that was given in the task headers.

I'll drop a note in the back room to see if the lord of the wrench can do something about the time part.

ttyl

** was collecting the different messages for errors on my own and all the wingmen errors, than lost it. There were like 6 of which 3 at least surely are device issues such as "too many exits" and a time exceed.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Mar 17, 2010 2:21:33 PM]

[ ]