Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3273 times and has 12 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

As noted during the Beta test, the checkpoints do not follow my clients setting of no more frequent than 5 minutes and is punching them out at the rate of under 1 minute, not even the 60 second default. Sample log of just 1 running this moment.

1257 WCG 14-11-2011 17:43 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1258 WCG 14-11-2011 17:44 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1259 WCG 14-11-2011 17:44 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1260 WCG 14-11-2011 17:45 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1261 WCG 14-11-2011 17:46 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1262 WCG 14-11-2011 17:47 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed
1263 WCG 14-11-2011 17:48 [checkpoint] result GFAM_x1df7_TBdhfrDry_0000226_0188_1 checkpointed

This can't be good for performance or for the [slowish] harddrive [in fact see efficiency it dropping now]. When all cores of a quad, octo etc is crunching them in parallel it could be led flicker fest.

--//--
[Nov 14, 2011 4:59:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

We're looking into this issue. VINA checkpoints write <1kb, so hopefully the performance impact wouldn't be significant. Have you also experienced this same problem on DSFL?

Seippel
[Nov 14, 2011 5:59:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

If we may call you Al, the impact of just 1 concurrent with 2.5 DSFL and .5 CEP2 that was running during the first part, the efficiency per BOINCTasks is 98.2% after 2:10 hours and 89 Checkpoints opposed to the 99.1-99.3% which is consistently recorded for DSFL (Linux). I'd be happy to test 4 concurrent, presently I've pushed a second ahead to see if there is cross impact.

-- Rob

Edit: I've long experimented with drive mount settings and found that *relatime* (now default in Ubuntu) and *noatime* (no read access updating, but write updating of timestamp) makes but a imperceptible difference. Will update when data becomes meaningful for different combos.
----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 14, 2011 6:23:05 PM]
[Nov 14, 2011 6:17:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dataman
Ace Cruncher
Joined: Nov 16, 2004
Post Count: 4865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

My first one completed with clock time 03:04:23 with 138 checkpoints. 97.86% CPU but I still have some cards running on that machine and will have to wait to get stat's when the cards drain in the queue.
----------------------------------------


[Nov 14, 2011 7:05:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

>once per minute, per task, might impact on lifespan of Solid state drives.

I'm running 3 climate models on an i7-2600 (8GB, using second HDD with Boinc, and only using 90% CPU), 2 HFCC tasks and one GFAM task.
The GFAM task has run for 60min, but is only at 5.5% complete. CPU time vs elapsed time is only 9seconds apart. The estimated time to complete is continuously rising. Is the progress way out, or is something else up? GFAM_x1df7_TBdhfrDry_0000245_0112_1

-- Was just the progress; it's now jumped to 25% complete, after 88min, and time remaining is 3h55min.
-- Now seems to be progressing normally; 37% after 103min, 3h remaining.
-- 75% took 3h 55min, 1h 18min remaining.
-- took 5.62h to complete, time lost was about 6min, ~1%
----------------------------------------
[Edit 4 times, last edit by skgiven at Nov 15, 2011 11:33:23 AM]
[Nov 14, 2011 8:29:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

Checkpoints in less than one minute? Interesting. I've been using my machine heavily today so my first GFAM task has run to 94.167% complete in 7:28:58 wall time but only 6:22:41 CPU. However, it took its last checkpoint at 5:33:34 CPU. Not exactly recently.

I think these WUs are not all well-behaved.
[Nov 15, 2011 12:21:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

Jobs inside a task are variable and now having seen half a dozen, observe that efficiency rapidly increases with the time it takes to get to the next checkpoint... and the slower the HD subsystem, the more impact it has. Now one GFAM averages 1.5 minutes per checkpoint and efficiency perked up to 98.6%. That's with 3 DSFL on the side... running at that 99.1-99.3%.

Ignorant of what these science apps do, BOINC wrapper has a standard call function that can be activated. It is to let it ask the core client if it is allowed to write a checkpoint at what frequency. It only asks this once at start or restart of a task. Then it maintains a counter and if 5 minutes is specified, it will skip writing a checkpoint if the last one was nearer than 5 minutes ago. It resets at that point, so if it took 15 minutes for the last one, the next will again check if the 5 minutes have passed.

If CEP2 were to listen to this function, which it does not either, I'd be setting my WtD to e.g. 200 minutes on my 24/7 Linux box [on UPS]. I'm sure it would gain many minutes per task **. My boots and updates are planned, so I can switch to short checkpointing sciences if need be. Last boot per byobu was 11 days ago.

--//--

** Of course, the scientists may want to see the checkpoints in the result plot the progression curve, but I've never read that to be the case. No checkpointing at all setting 720 minutes would be foolish for this science, lest the state *as-is* is captured when cut off at 12:00 hours. Somehow doubt that happens. Would turn into a gigabyte upload :O| Anyway, just a sidebar, my main concern the staccato if 4 GFAM or more were allowed to run concurrently. The noted SSD's might not take that kindly either.
[Nov 15, 2011 9:57:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

I turned-on checkpoint-debug on my machine and I was surprised that, in a span of 47minutes, there was an entry for a checkpoint on 50lines -- that's 50 times checkpoints in 47 minutes or just above 1 write-to-disk every minute! The WUs are DSFL_v6.19 target 65 WUs. I have my "Tasks checkpoint to disk at most every" settings set to 900-seconds (15minutes). Same thing with GFAM: a checkpoint every less than a minute. I thought this matter was resolved. What's going on here?
;
[Nov 16, 2011 6:32:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

Double posting this comment:

Never seemed to worry me for DSFL at 99.2% efficiency, but if the 99.2% can be increased to 99.6% by the [though it could be arduous] -mere- change of the WtD setting compliance (C4CW+HCC1 do 99.8% on this rig), that'd be icing on the cake. Any company would love to see their profitability go up by 40 basis points. At any rate, if GFAM does 98.2 and DSFL does 99.2, that's potentially 1 percent to gain, if not 1.5% with the cases on DSFL observed by andzgrid. That's hundreds of CPU years on the project duration.

--//--
[Nov 16, 2011 9:13:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GFAM: Checkpoint writing not adhering to client *Write to Disk* setting

Techs have been moving off the completed / validated tasks so fast, that I've hardly had time to catch the jobs per result but here is 2 samples demonstrating how much impact frequent / less frequent Write to Disk control is:

40 jobs, checkpoint interval 9.25 minutes, efficiency 99.2% (recognize that number)

GFAM_ x1dg7_ TBdhfrDry_ 0000276_ 0244_ 0-- 1767290 Valid 11/15/11 14:00:01 11/17/11 02:47:13 6.10 176.6 / 146.1 06:08:57 (06:05:55)

88 jobs, checkpoint interval 3.64 minutes, efficiency 98.16%

GFAM_ x1df7_ TBdhfrDry_ 0000248_ 0294_ 1-- 1767290 Pending Validation 11/15/11 06:12:03 11/16/11 12:31:56 5.34 153.2 / 0.0 05:25:58 (05:20:07)

In the above case, at a 5 minute WtD, the average write time would have been 7.28 minutes since the ones at 3.64 would have been skipped.

The top case ran with 2 CEP2 jobs on the side, the 2nd case with one CEP2 other slots occupied by DSFL on this quad. The consistent picture I'm getting from the first dozen is that there's a full percent efficiency to gain and probably more the slower the host and it's drive subsystem is. This is a quad Q6600 with Barracuda drive.

--//--

edit: The samples are in optimal, hands-off conditions, headless, no GUI loaded, per TOP only BOINC running.
----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 17, 2011 8:28:50 AM]
[Nov 17, 2011 8:19:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread