Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 90
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And some units are ending up really badly. All 5 copies exited in Job #3 with RC = 0x1 or 0x100 but all resulted in Error (or Too Late).
BETA_ E236439_ 34_ S.430.C54H26S5.PPZCAYPFBDNNFH-UHFFFAOYSA-N.15_ s1_ 14_ 4-- Microsoft Windows 10 x64 Edition, (10.00.10586.00) 700 Too Late 18/03/16 01:58:24 18/03/16 10:38:19 8.31 263.8 / 0.0 BETA_ E236439_ 34_ S.430.C54H26S5.PPZCAYPFBDNNFH-UHFFFAOYSA-N.15_ s1_ 14_ 2-- Linux 3.13.0-32-generic 700 Error 17/03/16 10:22:48 18/03/16 01:58:16 14.90 251.5 / 0.0 BETA_ E236439_ 34_ S.430.C54H26S5.PPZCAYPFBDNNFH-UHFFFAOYSA-N.15_ s1_ 14_ 3-- Linux 4.5.0-rc5 700 Error 17/03/16 10:22:42 18/03/16 00:28:11 13.45 268.0 / 0.0 BETA_ E236439_ 34_ S.430.C54H26S5.PPZCAYPFBDNNFH-UHFFFAOYSA-N.15_ s1_ 14_ 1-- Microsoft x64 Edition, (10.00.10586.00) 700 Error 16/03/16 23:08:10 17/03/16 10:09:58 9.16 266.0 / 0.0 BETA_ E236439_ 34_ S.430.C54H26S5.PPZCAYPFBDNNFH-UHFFFAOYSA-N.15_ s1_ 14_ 0-- Microsoft Windows 8.1 Enterprise x64 Edition, (06.03.9600.00) 700 Error 16/03/16 23:06:30 17/03/16 10:22:33 5.31 148.9 / 0.0 |
||
|
minus56bits
Cruncher Joined: Jan 14, 2016 Post Count: 3 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
BETA_E236440_732_S.454.C52H20N6O2S4.UVQUMKKFZKZIPZ-UHFFFAOYS was looping just below 1% (that is about 11 minutes on my machine) for more than 2 hours. Estimated completion time was 50:26 hours.
----------------------------------------STDERR.TXT showed several messages like "missing heartbeat". Unfortunately I can't post the file as I killed it somehow while playing around and after that the task went to "Computing Error". :-( Sorry. I think the error message is misleading as 7 UGM tasks were running (and are still running) in parallel without any issues. Client is version 7.2.47 on Windows 10 Pro on INTEL i7-6700K --------- Frank [Edit 1 times, last edit by minus56bits at Mar 18, 2016 12:27:04 PM] |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Has your system done CEP2 before? [This is the same app in Beta as used in production, just tasks with different configurations]. Reason I'm asking is, the missing heartbeat error hints at performance issues [too heavy to let the task keep in touch with the client, to tell it's still running... 30 seconds interruption will reset the task or kill it].
----------------------------------------Edit: To add, the application goes into a setup phase which can last quite a while, and is so heavy on the disk I/O that actual computation does not start until that is finished. Occasionally I see 5-7 minutes pass depending on overall system load, where only Elapsed is clocking time, then when calculations starts the CPU time counter also begins ticking. This is not easy to monitor in the official BOINC Manager, but if you select a starting CEP2 task and hit the properties button on left [only in the BOINC Manager advanced view], you'll see lots more task progress details. [Edit 2 times, last edit by SekeRob* at Mar 18, 2016 1:05:09 PM] |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1672 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So happens to have asked for a feature that makes CEP2 never show a TTC higher than the cap of 18 hours. Your 1.5 days deadline is plenty time with the knowledge we have but the client currently can't be made to wise up on, at least AFAIK. Finally WU BETA_E236439_388_S.420.C44F2H16N6S5.SIQVRHDHDDANTL-UHFFFAOYSA-N.10_s1_14 has been sent back this morning after less than 6 computation hours but already "too late" ![]() Yves |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Are you sure it is a 'too late' too late, or just marked 'too late' as escape clause because 5 copies could not get a quorum together?
|
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1672 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am the only one marked "too late", the 4 other wingmen are stated as "error".
---------------------------------------- |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
@knreed, can't you find a different, additional moniker for these non-quorum converging [but not failed] results? To reuse an old status which was abolished for Pending Verification because of confusion, call these "Final inconclusive" or maybe "Unverifiable" :O).
|
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
I am the only one marked "too late", the 4 other wingmen are stated as "error". Exactly what I said, your 5th could not find a wingman because they all errored out, so the 5th gets the misleading 'Too Late' [Think it's explained in the Community maintained FAQ's]. Also see my previous post directed at knreed. |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
As for 'Community maintained FAQ's', they are -not- as having a 'watch' on that forum, yet to get a single mail there was action in there [only knreed added an OP before the rename].
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
On LINUX, Beta WUs seem to be running at higher priority than the other work units. OET units are being CPU starved. Other WUs on same machine with BETAs are running at 75% to 80% CPU utilization. Suspend the BETAs and other WUs climb back to 99% to 100%. This never happened with the previous testing This has been resolved as a BOINC user error. ![]() Forgot that ncpus had been changed and as a result there were 24 processes running on a 16 cpu machine since BOINC had been told there were more processors. Once the number of running WUs matched the number of processors, BETA WUs ran as expected. |
||
|
|
![]() |