Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 13
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
As noted during the Beta test, the checkpoints do not follow my clients setting of no more frequent than 5 minutes and is punching them out at the rate of under 1 minute, not even the 60 second default. Sample log of just 1 running this moment.
1257 WCG 14-11-2011 17:43 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1258 WCG 14-11-2011 17:44 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1259 WCG 14-11-2011 17:44 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1260 WCG 14-11-2011 17:45 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1261 WCG 14-11-2011 17:46 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1262 WCG 14-11-2011 17:47 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed 1263 WCG 14-11-2011 17:48 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed This can't be good for performance or for the [slowish] harddrive [in fact see efficiency it dropping now]. When all cores of a quad, octo etc is crunching them in parallel it could be led flicker fest. --//-- |
||
|
seippel
Former World Community Grid Tech Joined: Apr 16, 2009 Post Count: 392 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We're looking into this issue. VINA checkpoints write <1kb, so hopefully the performance impact wouldn't be significant. Have you also experienced this same problem on DSFL?
Seippel |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If we may call you Al, the impact of just 1 concurrent with 2.5 DSFL and .5 CEP2 that was running during the first part, the efficiency per BOINCTasks is 98.2% after 2:10 hours and 89 Checkpoints opposed to the 99.1-99.3% which is consistently recorded for DSFL (Linux). I'd be happy to test 4 concurrent, presently I've pushed a second ahead to see if there is cross impact.
------------------------------------------ Rob Edit: I've long experimented with drive mount settings and found that *relatime* (now default in Ubuntu) and *noatime* (no read access updating, but write updating of timestamp) makes but a imperceptible difference. Will update when data becomes meaningful for different combos. [Edit 1 times, last edit by Former Member at Nov 14, 2011 6:23:05 PM] |
||
|
Dataman
Ace Cruncher Joined: Nov 16, 2004 Post Count: 4865 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My first one completed with clock time 03:04:23 with 138 checkpoints. 97.86% CPU but I still have some cards running on that machine and will have to wait to get stat's when the cards drain in the queue.
----------------------------------------![]() |
||
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
>once per minute, per task, might impact on lifespan of Solid state drives.
----------------------------------------I'm running 3 climate models on an i7-2600 (8GB, using second HDD with Boinc, and only using 90% CPU), 2 HFCC tasks and one GFAM task. The GFAM task has run for 60min, but is only at 5.5% complete. CPU time vs elapsed time is only 9seconds apart. The estimated time to complete is continuously rising. Is the progress way out, or is something else up? GFAM_x1df7_TBdhfrDry_0000245_0112_1 -- Was just the progress; it's now jumped to 25% complete, after 88min, and time remaining is 3h55min. -- Now seems to be progressing normally; 37% after 103min, 3h remaining. -- 75% took 3h 55min, 1h 18min remaining. -- took 5.62h to complete, time lost was about 6min, ~1% [Edit 4 times, last edit by skgiven at Nov 15, 2011 11:33:23 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Checkpoints in less than one minute? Interesting. I've been using my machine heavily today so my first GFAM task has run to 94.167% complete in 7:28:58 wall time but only 6:22:41 CPU. However, it took its last checkpoint at 5:33:34 CPU. Not exactly recently.
I think these WUs are not all well-behaved. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Jobs inside a task are variable and now having seen half a dozen, observe that efficiency rapidly increases with the time it takes to get to the next checkpoint... and the slower the HD subsystem, the more impact it has. Now one GFAM averages 1.5 minutes per checkpoint and efficiency perked up to 98.6%. That's with 3 DSFL on the side... running at that 99.1-99.3%.
Ignorant of what these science apps do, BOINC wrapper has a standard call function that can be activated. It is to let it ask the core client if it is allowed to write a checkpoint at what frequency. It only asks this once at start or restart of a task. Then it maintains a counter and if 5 minutes is specified, it will skip writing a checkpoint if the last one was nearer than 5 minutes ago. It resets at that point, so if it took 15 minutes for the last one, the next will again check if the 5 minutes have passed. If CEP2 were to listen to this function, which it does not either, I'd be setting my WtD to e.g. 200 minutes on my 24/7 Linux box [on UPS]. I'm sure it would gain many minutes per task **. My boots and updates are planned, so I can switch to short checkpointing sciences if need be. Last boot per byobu was 11 days ago. --//-- ** Of course, the scientists may want to see the checkpoints in the result plot the progression curve, but I've never read that to be the case. No checkpointing at all setting 720 minutes would be foolish for this science, lest the state *as-is* is captured when cut off at 12:00 hours. Somehow doubt that happens. Would turn into a gigabyte upload :O| Anyway, just a sidebar, my main concern the staccato if 4 GFAM or more were allowed to run concurrently. The noted SSD's might not take that kindly either. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I turned-on checkpoint-debug on my machine and I was surprised that, in a span of 47minutes, there was an entry for a checkpoint on 50lines -- that's 50 times checkpoints in 47 minutes or just above 1 write-to-disk every minute! The WUs are DSFL_v6.19 target 65 WUs. I have my "Tasks checkpoint to disk at most every" settings set to 900-seconds (15minutes). Same thing with GFAM: a checkpoint every less than a minute. I thought this matter was resolved. What's going on here?
; |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Double posting this comment:
Never seemed to worry me for DSFL at 99.2% efficiency, but if the 99.2% can be increased to 99.6% by the [though it could be arduous] -mere- change of the WtD setting compliance (C4CW+HCC1 do 99.8% on this rig), that'd be icing on the cake. Any company would love to see their profitability go up by 40 basis points. At any rate, if GFAM does 98.2 and DSFL does 99.2, that's potentially 1 percent to gain, if not 1.5% with the cases on DSFL observed by andzgrid. That's hundreds of CPU years on the project duration. --//-- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Techs have been moving off the completed / validated tasks so fast, that I've hardly had time to catch the jobs per result but here is 2 samples demonstrating how much impact frequent / less frequent Write to Disk control is:
----------------------------------------40 jobs, checkpoint interval 9.25 minutes, efficiency 99.2% (recognize that number) GFAM_ x1dg7_ TBdhfrDry_ 0000276_ 0244_ 0-- 1767290 Valid 11/15/11 14:00:01 11/17/11 02:47:13 6.10 176.6 / 146.1 06:08:57 (06:05:55) 88 jobs, checkpoint interval 3.64 minutes, efficiency 98.16% GFAM_ x1df7_ TBdhfrDry_ 0000248_ 0294_ 1-- 1767290 Pending Validation 11/15/11 06:12:03 11/16/11 12:31:56 5.34 153.2 / 0.0 05:25:58 (05:20:07) In the above case, at a 5 minute WtD, the average write time would have been 7.28 minutes since the ones at 3.64 would have been skipped. The top case ran with 2 CEP2 jobs on the side, the 2nd case with one CEP2 other slots occupied by DSFL on this quad. The consistent picture I'm getting from the first dozen is that there's a full percent efficiency to gain and probably more the slower the host and it's drive subsystem is. This is a quad Q6600 with Barracuda drive. --//-- edit: The samples are in optimal, hands-off conditions, headless, no GUI loaded, per TOP only BOINC running. [Edit 1 times, last edit by Former Member at Nov 17, 2011 8:28:50 AM] |
||
|
|
![]() |