Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 18
|
![]() |
Author |
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2163 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It normally takes my iGPU to run the OPNG_0046xxx tasks in about 0.5 to 1.5 hours. Right at this moment the time to complete a OPNG task is displayed as 00:03:21, that's 3 minutes and 21 seconds, and that wouldn't be a problem unless there are way over 100 jobs in one single task.
----------------------------------------I've already got 3 OPNG tasks on my iGPU that errored out -today- because it took too long to compute, according to BOINC. (One task had 109 jobs inside and BOINC completed 101 of them in 100 minutes, then got aborted while I was asleep. It was then reissued to an NVIDIA device and it took them 7 minutes to complete that task. Yeah, but mine is an iGPU; be patient, my dear BOINC!) (The second task was also aborted after 100 minutes with only 94 jobs inside and 92 tasks completed.) (The third task was aborted, too, after 100 minutes with 106 jobs inside and 102 tasks completed.) Both tasks report: "exceeded elapsed time limit 6042.53 (943491.36G/156.14G)". One of these three tasks was also handed over to another wingman and it took them 110 minutes to complete; I'm pretty sure my device would have managed that, weren't it killed off by BOINC! So what can I do about this? Simplified question: how can I spot this situation and prevent this from happening? (I could abort all tasks that have more than 90 jobs inside, that's easy to do.) [Edit 4 times, last edit by adriverhoef at Jun 6, 2021 12:56:52 PM] |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1321 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Both tasks report: "exceeded elapsed time limit 6042.53 (943491.36G/156.14G)". So what can I do about this? Simplified question: how can I spot this situation and prevent this from happening? (I could abort all tasks that have more than 90 jobs inside, that's easy to do.) The fpops bound is 30 times the estimated fpops of your system. Your system seems to report a much too high fpops and so BOINC thinks it can process the job in time. Avoid that your system reports higher fpops than it really can. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2163 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Both tasks report: "exceeded elapsed time limit 6042.53 (943491.36G/156.14G)". So what can I do about this? Simplified question: how can I spot this situation and prevent this from happening? (I could abort all tasks that have more than 90 jobs inside, that's easy to do.) The fpops bound is 30 times the estimated fpops of your system. Your system seems to report a much too high fpops and so BOINC thinks it can process the job in time. It's 'funny', a few moments ago I've received some more OPNG tasks and now their estimated time is 2 minutes and 53 seconds, even less than before.Avoid that your system reports higher fpops than it really can. I've found that all devices in the room have the same values for fpops in client_state.xml:<rsc_fpops_est>31449712079576.000000</rsc_fpops_est> So what is your suggestion, Crystal Pellet, what should I do to avoid this situation? |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1321 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
<rsc_fpops_est>31449712079576.000000</rsc_fpops_est> So what is your suggestion, Crystal Pellet, what should I do to avoid this situation? Those 2 values are coming from the project. The value 156.14G is the denominator and coming from your system. Somewhere in your client_state.xml you could find that value in bytes. Should be something like: 167654000000. I would not expect p_fpops, cause that's cpu-related. When it's p_fpops, then more systems with high-end cpu's and low-end cards (iGPU) would have a similar problem. If that's the case you could halve the value (BOINC not running) and don't run BOINC's benchmark anymore. edit: Found in client_state.xml in the opng app_version part <flops>5787115698.556988</flops>, but this value is adjusted everytime. [Edit 2 times, last edit by Crystal Pellet at Jun 6, 2021 5:39:00 PM] |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 771 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I noticed my points dropping recently and upon checking several tasks timing out.
----------------------------------------All tasks appear to have same values so built below command to increase. No timeouts so far so appears good, need to run for each batch downloaded. Wait for checkpoints, stop BOINC, run below, restart. sudo sed -i 's/<rsc_fpops_bound>943491362387280.000000<\/rsc_fpops_bound>/<rsc_fpops_bound>1943491362387280.000000<\/rsc_fpops_bound>/' client_state.xml Paul.
Paul.
----------------------------------------[Edit 1 times, last edit by PMH_UK at Jun 6, 2021 4:58:11 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2163 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Paul, thanks for your solution. Let's hope we will not need it anymore after today's experiences.
Maybe it's time to clarify a bit. The aforementioned device with three timeout errors today has never seen "exceeded elapsed time limit" before with OPNG. This same device downloads some OPNG tasks several times a day and their completion times are recalculated each time, for all queued OPNG tasks at the same time, apparently. (That's normal behaviour for all subprojects, like e.g. MCM1 and ARP1.) So before any other OPNG tasks had even been started I was looking at estimated completion times of about 3 minutes today, which is unusually short, because usually this device is welcoming estimated completion times of about one hour or more. However, several hours after downloading tasks with that insane estimated completion time, another set of tasks was downloaded and the estimated completion times were restored then to a more sensible value of nearly three HOURS (instead of minutes), without my intervention. |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1321 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After some OPNG's with an estimated runtime of 1 hour 28 minutes and 8 seconds (AMD 7770),
----------------------------------------I got a bunch of tasks and the lifetime jumped to 6 minutes and 40 seconds. At the same time the flops in the app_version part of opng went up from 5946875132.406872 to 78492328901.667801 Half an hour later new tasks arrived and now estimated runtime jumped back to 1 hour, 28 minutes and 3 seconds. flops back to 5952525483.785850 [Edit 1 times, last edit by Crystal Pellet at Jun 7, 2021 7:30:08 AM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2163 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After one day with reasonable estimated runtimes (2½ hours) this morning when I woke up, I saw to my dismay that the estimated runtime had dropped to 2 minutes and 59 seconds again and some sized task (106 jobs inside) was running, so I was wondering if it would get there in time, before the 100 minute boundary. To help a bit more, I tried stopping the BOINC client, editing the client_state.xml file, adjusting the value in <rsc_fpops_bound>, then restarted the BOINC client. Well, no change. All the estimated runtimes were still at 2 minutes and 59 seconds. After all, it finished just in time: 1 hour and 38 minutes, which is just below 100 minutes.
----------------------------------------Noticed that one oversized task (with 146 jobs inside) had already finished, before waking up, in two hours and twelve minutes(!), which must have been with the same estimated runtime of 2 minutes and 59 seconds!(*see proof below) It is now still Pending Validation, just like the one with 106 jobs inside that finished thereafter. Result Name Status Sent Time Due / Return Time CPUh/Spent Claimed/Granted[Copied from Results Status, generated by wcgformat] 1623309730 ue 179.734730 ct 411.502700 fe 31449712079576 nm OPNG_0049963_00062_1 et 7855.492005 es 0 So now they don't get aborted too early anymore (after 100 minutes)? ![]() [Edit 2 times, last edit by adriverhoef at Jun 12, 2021 9:03:51 PM] |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 771 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Increasing rsc_fpops_est would change the estimated time.
----------------------------------------rsc_fpops_bound controls the limit. I have had only 1 fail where I had increased and still exceeded. Paul.
Paul.
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2163 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Increasing rsc_fpops_est would change the estimated time. That would sound more logical, too, Paul. rsc_fpops_bound controls the limit. ![]() A little script should do the job (YMMV):
|
||
|
|
![]() |