World Community Grid - View Thread - [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Quick Go »

No member browsing this thread

Thread Status: Locked
Total posts in this thread: 210

[ ]

Author

This topic has been viewed 20177 times and has 209 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Hello PepoS,
I had better explain that a page fault starts in the cache on the CPU chip. A new page has to be loaded from main memory into the chip cache. At this point, the size of the main memory is irrelevant. A second problem occurs if the requested page is not in memory and has to be loaded up from disk. This is much slower but it is not (I think) the problem that HCC has. The size of the main memory would matter if this second problem occurred a lot. HCC can generate so many page faults that the CPU is slowed down by the memory bus, which cannot keep the cache filled.

Lawrence

[Dec 18, 2007 3:16:18 PM]

Movieman
Veteran Cruncher
Joined: Sep 9, 2006
Post Count: 1042
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Hello PepoS,
I had better explain that a page fault starts in the cache on the CPU chip. A new page has to be loaded from main memory into the chip cache. At this point, the size of the main memory is irrelevant. A second problem occurs if the requested page is not in memory and has to be loaded up from disk. This is much slower but it is not (I think) the problem that HCC has. The size of the main memory would matter if this second problem occurred a lot. HCC can generate so many page faults that the CPU is slowed down by the memory bus, which cannot keep the cache filled.

Lawrence

Could this be why the 8 core clovers are taking such a huge hit?
FSB speeds are 1400 maximum with the DDR2-667FBDimms while a lot of the quads at Xs run FSB speeds up to 1800 and just about all are at 1600 using DDR2-800

----------------------------------------

[Dec 18, 2007 3:43:15 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Lawrence, I'm aware of both these sources of page faults. But I thought that CPU cache miss should not produce interrupts, processed with kernel calls, if the required blocks are already available in process' mapped RAM. Still, the amount of time spent in kernel calls I was observing was 25% - in my opinion these are not just core cache misses.

Anyway, I've got a silly idea to take a look at few HCC's task X0000054330932200508190032_1_0's call stack snapshots, whether there will be any interesting and repeating pattern. Take a look yourself (the snapshots were taken in approx. 20 second intervals, nothing exact, just take, copy and paste):

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
wcg_hcc1_img_5.15_windows_intelx86+0x12bb94
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
wcg_hcc1_img_5.15_windows_intelx86+0x366fa
wcg_hcc1_img_5.15_windows_intelx86+0x52f97
wcg_hcc1_img_5.15_windows_intelx86+0x52784
wcg_hcc1_img_5.15_windows_intelx86+0x104cfe
---
ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!KiThreadStartup+0x16
NDIS.sys!ndisWorkerThread
wcg_hcc1_img_5.15_windows_intelx86+0x1b1ce
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
wcg_hcc1_img_5.15_windows_intelx86+0x366fa
wcg_hcc1_img_5.15_windows_intelx86+0x52f97
wcg_hcc1_img_5.15_windows_intelx86+0x52784
wcg_hcc1_img_5.15_windows_intelx86+0x104cfe
---
etc.

As you can see, the last 5 call stack addresses are repeating, so I'll omit the lowest (or highest) 4 of them in following snapshots:

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!KiThreadStartup+0x16
NDIS.sys!ndisWorkerThread
wcg_hcc1_img_5.15_windows_intelx86+0x1b1ce
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

ntkrnlpa.exe!KiDispatchInterrupt+0xa7
ntkrnlpa.exe!MmAccessFault+0x11ae
wcg_hcc1_img_5.15_windows_intelx86+0x12bea7
wcg_hcc1_img_5.15_windows_intelx86+0x19c5c
...

I suppose the devs (and Didactylos too, as he was profiling the app) know it already, but it was interesting for me to see all these (predicted) kernel calls, processing the memory access (issues?). Let's see (after finishing the Linux version, and if the devs wil get the time and "OK" to take a look) if we will sometimes find out, what to do against.

[Dec 18, 2007 3:51:27 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

...a page fault starts in the cache on the CPU chip. A new page has to be loaded from main memory into the chip cache. [...] HCC can generate so many page faults that the CPU is slowed down by the memory bus, which cannot keep the cache filled.

If the memory accesses (caused by the used algorithm) are "nicely" spread across the allocated memory, then, YES, it can easily overload any current CPU with FSB.

But still, in my opinion, these core cache misses should not cause kernel interrupts, if the data is available in RAM. Except that some memory protections or I dont know what would cause additional kernel calls with each (few) memory block misses.

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 18, 2007 4:01:12 PM]

[Dec 18, 2007 3:55:21 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

This is exactly what I'm thinking.

When Windows reports a page fault, as far as I'm aware it's a main memory page fault and has nothing to do with CPU cache.

[Dec 18, 2007 4:14:07 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Hello PepoS and Questar,
I think you are right about the vocabulary. I need to read up on this or I will just confuse people with my idiosyncratic usage. Right now I am running HCC on a single core Windows XP system, using 521 MB for the application and another 122 MB for OS and overhead. I am accessing the drive once every 3 to 5 seconds. But I am throwing up a large number of cache misses, which is slowing down my computing. I am spending a noticeable amount of time in the kernel, but I am not sure why. If I had multiple cores, I would expect a large amount of memory contention, but I am guessing at this point.

Without the source, I am too lazy to try to really map out the performance.

Lawrence

[Dec 18, 2007 7:30:13 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

I think Highwire has the best explanation so far. It agrees with what I've read, my test results, and what knreed has said. Also, he has some experimental results to back it up.

The only remaining question is whether it is practical to refactor the application to avoid this memory allocation pattern. Obviously, it can be done. But it may be trivial or it may be very, very, non-trivial, depending on the code. I expect WCG will do what they have in the past: improve the worst areas to the best of their ability in the time available.

[Dec 18, 2007 7:45:21 PM]

123bob
Cruncher
Joined: May 1, 2007
Post Count: 42
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Some may already know this, but.....

I've been looking for a work around. I may have found one. The data below (I hope it formats right on the forum... tongue

) shows a problem machine before and after a vista upgrade. This machine was running serv2003 32 bit before, with BOINC 5.10.13. It is now running Vista Ult 64 bit with BOINC 5.10.28. This is a Q6600, mildly overclocked. (and yes, this also happens on two stock clocked machines I've been testing too...)

I'm not sure if it's the Vista or the BOINC that stabilized this thing. Page fault counts went from billions to thousands with this move!! Look at how consistent the WUs run.

I've made this move on four machines. All four show the same stabilization.

Regards,
Bob

Win Server 2003 32 bit, BOINC 5.10.13
Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
X0000053701414200507181229_ 1-- BOBS-FARM-04 Valid 12/15/2007 11:47 12/17/2007 1:14 3.21 69.0 / 67.2
X0000053700504200507181245_ 0-- BOBS-FARM-04 Valid 12/15/2007 11:16 12/17/2007 1:14 3.1 66.5 / 73.7
X0000053700476200507181246_ 1-- BOBS-FARM-04 Valid 12/15/2007 11:15 12/17/2007 4:03 5.74 122.9 / 74.5
X0000053700736200507180920_ 1-- BOBS-FARM-04 Valid 12/15/2007 10:27 12/17/2007 1:14 5.3 113.9 / 69.0
ll117_ 00044_ 2-- BOBS-FARM-04 Valid 12/15/2007 10:26 12/16/2007 23:52 4.21 90.2 / 83.7
X0000053691317200508152322_ 1-- BOBS-FARM-04 Valid 12/15/2007 9:42 12/16/2007 23:52 5.17 110.8 / 57.8
ll116_ 00160_ 4-- BOBS-FARM-04 Valid 12/15/2007 9:19 12/16/2007 23:52 4.11 88.0 / 83.3
X0000053691167200507181207_ 1-- BOBS-FARM-04 Valid 12/15/2007 8:40 12/16/2007 23:52 5.03 107.9 / 73.3
X0000053691233200507180843_ 1-- BOBS-FARM-04 Valid 12/15/2007 7:35 12/16/2007 16:08 4.55 97.3 / 103.8
X0000053690140200507180901_ 1-- BOBS-FARM-04 Valid 12/15/2007 6:49 12/16/2007 23:52 6.11 131.0 / 131.0
X0000053341031200507130919_ 1-- BOBS-FARM-04 Valid 12/14/2007 18:07 12/16/2007 1:58 5.09 109.2 / 68.3
X0000053201106200507120844_ 0-- BOBS-FARM-04 Valid 12/14/2007 12:59 12/16/2007 1:58 3.46 74.2 / 77.4
X0000053200888200507120849_ 0-- BOBS-FARM-04 Valid 12/14/2007 12:49 12/16/2007 1:58 3.83 82.2 / 75.1
X0000053140693200507111427_ 0-- BOBS-FARM-04 Valid 12/14/2007 11:30 12/15/2007 22:25 5.13 110.4 / 78.3
X0000052980196200507080905_ 1-- BOBS-FARM-04 Valid 12/14/2007 9:31 12/15/2007 22:25 5.36 115.5 / 85.3

Vista Ultimate 64 bit, BOINC 5.10.28
Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
X0000055740864200508171137_ 1-- bobs-farm-04 Valid 12/19/2007 5:38 12/20/2007 16:29 2.69 68.1 / 76.8
X0000055740640200508171139_ 1-- bobs-farm-04 Valid 12/19/2007 5:29 12/20/2007 16:29 2.65 67.1 / 59.4
X0000055740408200508171144_ 1-- bobs-farm-04 Valid 12/19/2007 5:14 12/20/2007 16:29 2.7 68.4 / 66.3
X0000055520867200509022121_ 1-- bobs-farm-04 Valid 12/19/2007 3:01 12/20/2007 16:29 2.71 68.8 / 72.0
X0000055520791200509022122_ 0-- bobs-farm-04 Valid 12/19/2007 3:00 12/20/2007 16:29 2.65 67.2 / 60.5
X0000055521380200508191534_ 1-- bobs-farm-04 Valid 12/19/2007 1:13 12/20/2007 16:29 2.7 68.4 / 76.6
X0000055520127200508120832_ 0-- bobs-farm-04 Valid 12/18/2007 23:07 12/20/2007 8:20 2.74 69.4 / 78.8
X0000055520004200508120834_ 0-- bobs-farm-04 Valid 12/18/2007 23:05 12/20/2007 7:19 2.65 67.1 / 71.2
X0000055511496200509022044_ 0-- bobs-farm-04 Valid 12/18/2007 23:04 12/20/2007 7:19 2.66 67.3 / 67.4
X0000055511365200509022047_ 1-- bobs-farm-04 Valid 12/18/2007 23:02 12/20/2007 6:02 2.77 68.7 / 68.7
X0000055511363200509022047_ 0-- bobs-farm-04 Valid 12/18/2007 23:02 12/20/2007 6:02 2.67 66.1 / 68.3
X0000055510854200508262138_ 0-- bobs-farm-04 Valid 12/18/2007 18:52 12/20/2007 2:39 2.66 65.9 / 66.0
X0000055510719200508262140_ 1-- bobs-farm-04 Valid 12/18/2007 18:42 12/20/2007 1:08 2.7 66.8 / 74.1
X0000055510644200508262141_ 1-- bobs-farm-04 Valid 12/18/2007 18:40 12/20/2007 0:51 2.61 64.6 / 64.5

[Dec 20, 2007 5:13:05 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Thanks 123bob,

This shows that the problem is a solvable one. I don't have it (much) on my one-core machine, so I have been just guessing why it hits other people so hard. It looks as though it wil be some sort of problem such as Highwire suggested.

The techs are aware of this problem (and several others) but I won't start nagging them until after New Year's Day. devilish

Lawrence

[Dec 20, 2007 10:01:11 PM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1677
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

5 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

20 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: [RENAMED] Some concerns regarding the HCC project (page fault and poor performance, in particular (but not only) by multi-core hosts; #cores>2)

Hello Everybody,

Christmas time is arrived for all of us, including the techs who support the grid computing projects around the clock and around the year; especially to them, I wish a great Christmas time ! rose

Because, I was very busy for business reasons (incl. traveling) during the last weeks, I was only able to follow the thread development respectively the results of the various investigations from time to time.
Personally, I had the feeling that the community made a lot of progress, even if the root cause is still not properly identified.
I hope that everybody - especially people devoting several hosts to the grid computing projects - will monitor their host performance more accurately than in the past. Because of this "dramatic" performance problem, I reallocated the various WCG projects with a little bit more understanding. Indeed, I reach today a similar performance than two months ago although I had to retire one of my best host (T7200, 2 GHz, 2 GB RAM) !
In order to achieve the different projects within a reasonable time and with a reasonable energy consumption (environmental protection should be also an issue), we have to become better by managing the computation resources.
Such performance review or monitoring must be accurately performed by the introduction of new projects. I don't know if some crunchers should be volunteer for reporting any "side effects" during the first weeks or months of a project ! ...
Additionally to WCG and BOINCstats reports, I think that, at least, some members have to cross compare the platform/host performance against projects in order to identify critical configurations and to optimize computation efficiency by providing recommendations.
Today, this is my "current thinking". I would enjoy if some of you could submit some ideas for becoming better and for working in a more efficient way.
I would like to share ideas about it and to collaborate for elaborating utilities or tools helping to cross-compare platform and project and to monitor performance in a time efficient manner.
Maybe, such discussions should occur on a separate thread because it impacts every project and not HCC only !

Again, merry Christmas for everybody !
Cheers,

----------------------------------------

Décrypthon team progress - KerSamson's contribution

[Dec 23, 2007 10:48:17 AM]

[ ]