Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 22
|
![]() |
Author |
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 771 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In my experience some versions of BOINC can go into panic mode with a high "switch between" value compared to the cache size.
----------------------------------------For me the fix was to reduce the "switch between" and update. The new value takes effect immediately and if low enough BOINC stops panicing and lets the active tasks run to completion followed by any waiting to run. Paul.
Paul.
|
||
|
joeperry39@gmail.com
Advanced Cruncher USA Joined: Nov 22, 2006 Post Count: 140 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
SekeRob said: System date right is important. If that work was fetched with a wrong system date and it then moved forward it could also have kicked this off [the inflated switch time wont have helped the situation]. Abort all unstarted jobs, reduce cache to 1 day and let it run a few days. If it does not right itself, check in again. --//-- Prior to KeithHenry posting a suggested solution, I did abort all of the jobs (109 of them, all due on the 15th except 1 for the 16th) and am forcing the previously started jobs to clear out. As soon as they are completed, I'll let everything run normally. I do believe the "wrong date & time" may well have been the problem. I remember that, at some point, both of those were incorrect, and the date was several days prior to the then current date. When I noticed that problem I immediately corrected the date & time. Problem is I don't remember if BOINC was or was not "up-and-running" at that point. At any rate, I'll continue with the solution suggested by SekeRob and see what happens. The "started but not completed and then restarted jobs" should clear out later today and everything will then, hopefully, be back to normal. Thanks to all that replied with suggestions for correcting this problem. I'll report back after a few days as to the then-current status. ![]() "Everything in moderation, including moderation" -- Mark Twain [Edit 1 times, last edit by osugrad at Dec 7, 2011 4:28:21 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Exactly, the 6000 minutes switch was designed to let Beta/Repairs jump the queue automatically, 18000 minutes (12.5 days) will definitely have the panic persist through total empty state. And when there's all-core panic some versions will stop fetching work until a core goes idle. Aborting is easiest, manure happens, and does not risk the tasks at the end of the line to go overdue anyhow. The wingmen will appreciate it. The daily quota is big enough to not cut anyone out that has the incidental mishap and it does not eat into any reliability rating to get repairs, which of course with a 3 day cache or bigger is nowhere land.
------------------------------------------//-- edit: This was a comment follow-on to PMH_UK's post. Had not seen the osugrad's reply till now. [Edit 1 times, last edit by Former Member at Dec 8, 2011 12:14:52 PM] |
||
|
depriens
Senior Cruncher The Netherlands Joined: Jul 29, 2005 Post Count: 350 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It sometimes happens (out-of-the-blue) to some of my machines as well, particularly to those with a larger workunit cache. BOINC then pauses the units running and starts running other units with a later deadline. It then runs them for an hour and then pauses them to start some other units. This will go on until eventually all units are finished.
----------------------------------------_IF_ I notice it, I manually pause all units not started yet until most halfway units are finished. Then I unpause all units. The workfetch resumes, BOINC finishes the units it's working on at that moment and then continues in the correct sequence. ![]() ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
it only happens to my i7 when i set the cache to 1.20 or higher
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Makes no sense in the least, unless running with keithhenry's 18000 minute switch or similar. Why BOINC does not have the lucency to read and execute as: Hey, total TTC of running and ready to start tasks is less than any deadline of task in cache, just continue normal... beyond me. Ingleside may have the answer why BOINC acts irrational on that.
----------------------------------------For sure, it's possible that events such as a stuck WU, restarted and completed with huge elapsed but normal CPU time kicks the inflation off, but however big the inflation, 1.2 days cache is still 1.2 days cache. Work fetch suspends until it sinks below that value. Do check task properties how Elapsed/CPU time relate. Has the lost CPU time devil returned for some? It's one of the reasons why I barely touch the standard BM and only use BOINCTasks. This tool shows both values side by side, so it's easy to follow if there is a stalling/performance issue on CPU time side, and certainly great to follow each tasks efficiency. (I'm still presuming all these comments and observations are from known stable BOINC versions and not any in-between alpha/beta). --//-- Edit: a comma [Edit 1 times, last edit by Former Member at Dec 8, 2011 12:30:07 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
boinc 6.10.60 elapse time usually 1 - 2 min swich between apps 7200
running cep2 gfam dsfl on me i7 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Well, the combination of variable time with this 7200 minute switch is what it does. Since the Beta/Repair are 4 days deadline and BOINC using a safety margin on top of hours to a day [varies per version], try 4400 min. Cache plus switch time is then adding to > 5760 minutes. Since you run a cache of 1.2 days, your device if stable is getting repairs which carry... 4 days deadline.
----------------------------------------Let us know if the EDF still hits when those 4 days deadline arrive, but not when 7 or 10 days, with that 1.2 days cache of course. --//-- edit: math adj. [Edit 1 times, last edit by Former Member at Dec 8, 2011 3:57:59 PM] |
||
|
pcwr
Ace Cruncher England Joined: Sep 17, 2005 Post Count: 10903 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mine has been doing it for the last few days (I had 6 days cache to do in only 7 calendar days). Some how it works it out to meet the deadline for all the WUs. All my DDDT2 WUs got in in time. Hopefully all my HCMD2 will as well.
----------------------------------------Cache set back to 2 days until Beta comes again. Patrick ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Errata: I've actually backed off from the long switch time on a permanent basis. I'll only activate this trick when Beta's are announced. I don't like [detest] racing through the repairs and finding out that the original still arrived within the grace period (lasts as long as the validated results show on the RS pages... are "live"). By letting repairs complete normal pace, the original has about a day extra (my cache size), and WCG gets the chance to tell my client to abort the redundant task.
--//-- |
||
|
|
![]() |