World Community Grid - View Thread - Insane short time to complete, tasks get aborted too early

World Community Grid Forums

Category: Support

Forum: GPU Support Forum

Thread: Insane short time to complete, tasks get aborted too early

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 18

[ ]

Author

This topic has been viewed 10802 times and has 17 replies

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2163
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Insane short time to complete, tasks get aborted too early

It normally takes my iGPU to run the OPNG_0046xxx tasks in about 0.5 to 1.5 hours. Right at this moment the time to complete a OPNG task is displayed as 00:03:21, that's 3 minutes and 21 seconds, and that wouldn't be a problem unless there are way over 100 jobs in one single task.

I've already got 3 OPNG tasks on my iGPU that errored out -today- because it took too long to compute, according to BOINC.

(One task had 109 jobs inside and BOINC completed 101 of them in 100 minutes, then got aborted while I was asleep. It was then reissued to an NVIDIA device and it took them 7 minutes to complete that task. Yeah, but mine is an iGPU; be patient, my dear BOINC!)
(The second task was also aborted after 100 minutes with only 94 jobs inside and 92 tasks completed.)
(The third task was aborted, too, after 100 minutes with 106 jobs inside and 102 tasks completed.)

Both tasks report: "exceeded elapsed time limit 6042.53 (943491.36G/156.14G)".

One of these three tasks was also handed over to another wingman and it took them 110 minutes to complete; I'm pretty sure my device would have managed that, weren't it killed off by BOINC!

So what can I do about this? Simplified question: how can I spot this situation and prevent this from happening? (I could abort all tasks that have more than 90 jobs inside, that's easy to do.)

----------------------------------------
[Edit 4 times, last edit by adriverhoef at Jun 6, 2021 12:56:52 PM]

[Jun 6, 2021 10:05:34 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1321
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

90 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project


Re: Insane short time to complete, tasks get aborted too early

Both tasks report: "exceeded elapsed time limit 6042.53 (943491.36G/156.14G)".

So what can I do about this? Simplified question: how can I spot this situation and prevent this from happening? (I could abort all tasks that have more than 90 jobs inside, that's easy to do.)

The fpops bound is 30 times the estimated fpops of your system.
Your system seems to report a much too high fpops and so BOINC thinks it can process the job in time.
Avoid that your system reports higher fpops than it really can.

[Jun 6, 2021 2:11:04 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2163
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

The fpops bound is 30 times the estimated fpops of your system.

That's what I've figured out, too, Crystal Pellet.

Your system seems to report a much too high fpops and so BOINC thinks it can process the job in time.

It's 'funny', a few moments ago I've received some more OPNG tasks and now their estimated time is 2 minutes and 53 seconds, even less than before.

Avoid that your system reports higher fpops than it really can.

I've found that all devices in the room have the same values for fpops in client_state.xml:

    <rsc_fpops_est>31449712079576.000000</rsc_fpops_est>
    <rsc_fpops_bound>943491362387280.000000</rsc_fpops_bound>

So what is your suggestion, Crystal Pellet, what should I do to avoid this situation?

[Jun 6, 2021 3:36:12 PM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1321
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

    <rsc_fpops_est>31449712079576.000000</rsc_fpops_est>
    <rsc_fpops_bound>943491362387280.000000</rsc_fpops_bound>

So what is your suggestion, Crystal Pellet, what should I do to avoid this situation?

Those 2 values are coming from the project.
The value 156.14G is the denominator and coming from your system.
Somewhere in your client_state.xml you could find that value in bytes. Should be something like: 167654000000.
I would not expect p_fpops, cause that's cpu-related.
When it's p_fpops, then more systems with high-end cpu's and low-end cards (iGPU) would have a similar problem.
If that's the case you could halve the value (BOINC not running) and don't run BOINC's benchmark anymore.

edit: Found in client_state.xml in the opng app_version part <flops>5787115698.556988</flops>, but this value is adjusted everytime.

----------------------------------------
[Edit 2 times, last edit by Crystal Pellet at Jun 6, 2021 5:39:00 PM]

[Jun 6, 2021 4:09:55 PM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 771
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: Insane short time to complete, tasks get aborted too early

I noticed my points dropping recently and upon checking several tasks timing out.
All tasks appear to have same values so built below command to increase.
No timeouts so far so appears good, need to run for each batch downloaded.
Wait for checkpoints, stop BOINC, run below, restart.

sudo sed -i 's/<rsc_fpops_bound>943491362387280.000000<\/rsc_fpops_bound>/<rsc_fpops_bound>1943491362387280.000000<\/rsc_fpops_bound>/' client_state.xml

Paul.

----------------------------------------

Paul.

----------------------------------------
[Edit 1 times, last edit by PMH_UK at Jun 6, 2021 4:58:11 PM]

[Jun 6, 2021 4:57:42 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2163
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

Paul, thanks for your solution. Let's hope we will not need it anymore after today's experiences.

Maybe it's time to clarify a bit.

The aforementioned device with three timeout errors today has never seen "exceeded elapsed time limit" before with OPNG.

This same device downloads some OPNG tasks several times a day and their completion times are recalculated each time, for all queued OPNG tasks at the same time, apparently. (That's normal behaviour for all subprojects, like e.g. MCM1 and ARP1.) So before any other OPNG tasks had even been started I was looking at estimated completion times of about 3 minutes today, which is unusually short, because usually this device is welcoming estimated completion times of about one hour or more.

However, several hours after downloading tasks with that insane estimated completion time, another set of tasks was downloaded and the estimated completion times were restored then to a more sensible value of nearly three HOURS (instead of minutes), without my intervention.

[Jun 6, 2021 11:44:35 PM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1321
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

After some OPNG's with an estimated runtime of 1 hour 28 minutes and 8 seconds (AMD 7770),
I got a bunch of tasks and the lifetime jumped to 6 minutes and 40 seconds.
At the same time the flops in the app_version part of opng went up from
5946875132.406872 to
78492328901.667801
Half an hour later new tasks arrived and now estimated runtime jumped back to 1 hour, 28 minutes and 3 seconds.
flops back to 5952525483.785850

----------------------------------------
[Edit 1 times, last edit by Crystal Pellet at Jun 7, 2021 7:30:08 AM]

[Jun 7, 2021 7:20:38 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2163
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

After one day with reasonable estimated runtimes (2½ hours) this morning when I woke up, I saw to my dismay that the estimated runtime had dropped to 2 minutes and 59 seconds again and some sized task (106 jobs inside) was running, so I was wondering if it would get there in time, before the 100 minute boundary. To help a bit more, I tried stopping the BOINC client, editing the client_state.xml file, adjusting the value in <rsc_fpops_bound>, then restarted the BOINC client. Well, no change. All the estimated runtimes were still at 2 minutes and 59 seconds. After all, it finished just in time: 1 hour and 38 minutes, which is just below 100 minutes.

Noticed that one oversized task (with 146 jobs inside) had already finished, before waking up, in two hours and twelve minutes(!), which must have been with the same estimated runtime of 2 minutes and 59 seconds!(*see proof below) It is now still Pending Validation, just like the one with 106 jobs inside that finished thereafter.

Result Name          Status                 Sent Time     Due / Return Time  CPUh/Spent Claimed/Granted
OPNG_0049963_00084_1 Pending Validation   6/9/21 03:23:02  6/10/21 09:08:13  0.09/1.63      1.6/0.0
OPNG_0049963_00062_1 Pending Validation   6/9/21 03:23:02  6/10/21 07:22:22  0.11/2.18      2.1/0.0

[Copied from Results Status, generated by wcgformat]

Proof from job_log_www.worldcommunitygrid.org.txt:

1623309730 ue 179.734730 ct 411.502700 fe 31449712079576 nm OPNG_0049963_00062_1 et 7855.492005 es 0

So now they don't get aborted too early anymore (after 100 minutes)? confused

Anyone else with their experiences in this matter?

----------------------------------------
[Edit 2 times, last edit by adriverhoef at Jun 12, 2021 9:03:51 PM]

[Jun 10, 2021 10:52:48 AM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 771
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

Increasing rsc_fpops_est would change the estimated time.
rsc_fpops_bound controls the limit.

I have had only 1 fail where I had increased and still exceeded.

Paul.

----------------------------------------

Paul.

[Jun 10, 2021 11:14:42 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2163
Status: Offline
Project Badges:


Re: Insane short time to complete, tasks get aborted too early

Increasing rsc_fpops_est would change the estimated time.
rsc_fpops_bound controls the limit.

That would sound more logical, too, Paul. smile

A little script should do the job (YMMV):


# Define variables:
BOINCDIR=~boinc; DIR=/var/lib/boinc; [ -d $DIR ] && BOINCDIR=$DIR
CLIENT_STATE=$BOINCDIR/client_state.xml

# Stop BOINC:
sudo systemctl stop boinc-client
# Increase the estimated time for OPNG tasks:
sudo perl -w -i -p -e '
  if (($b,$v,$e) = /(<rsc_fpops_est>)(31449712079576)(\.000000<\/rsc_fpops_est>)/) {
    $v = 2 * $v;
    s//$b$v$e/;
  }
' $CLIENT_STATE
# Restart BOINC:
sudo systemctl start boinc-client

[Jun 10, 2021 11:28:59 PM]

[ ]