World Community Grid - View Thread

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Bad work unit

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 17

[ ]

Author

This topic has been viewed 2782 times and has 16 replies

Jesse Viviano
Cruncher
United States of America
Joined: Dec 14, 2007
Post Count: 15
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

14 day badge for Help Fight Childhood Cancer

45 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Computing for Clean Water

180 day badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

45 day badge for OpenPandemics - COVID-19


Bad work unit

Please see work unit 1277772467. This work unit failed on one user's computer due to exceeding a CPU time limit, and I had to abort it because it never made any significant progress on my machine before it could check point anything on my machine and went past the deadline. Please investigate this work unit. This is the first work unit that I have had to abort on the World Community Grid due to problems with the work unit itself.

[Jan 12, 2015 11:13:55 PM]

Yarensc
Advanced Cruncher
USA
Joined: Sep 24, 2011
Post Count: 136
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

1 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Bad work unit

All CEP work units have a cap of 18 hours. The main reason for this is some of the more complicated molecules accidentally get released to the grid (prediction algorithms aren't perfect) and that limit is in place so that when that happens a machine isn't trying to crunch one for days and days

[Jan 14, 2015 2:03:52 AM]

Jesse Viviano
Cruncher
United States of America
Joined: Dec 14, 2007
Post Count: 15
Status: Offline
Project Badges:


Re: Bad work unit

That cap did not work on my machine on this work unit. I generally run my machine for 10 to 12 hours a day, and it never got to a point where a checkpoint can be made. I don't mind losing work due to the infrequent checkpoints, but I do mind never getting to a checkpoint. Since the checkpoint never got reached, so there was no record of any work that took place and the CPU time counter kept getting zeroed out. The 18 hour limit therefore failed, and I had to abort it once the deadline had passed and I realized that I could not ever finish this result.

[Jan 14, 2015 8:00:34 PM]

Jesse Viviano
Cruncher
United States of America
Joined: Dec 14, 2007
Post Count: 15
Status: Offline
Project Badges:


Re: Bad work unit

It turns out that someone with a faster CPU than mine was able to handle this work unit. Job 1 of 8 in this work unit took eight and a half hours of contiguous time to process, but the rest of the jobs were normal-sized. The work unit was bad for those who cannot keep their machines running for extremely long amounts of time.

[Jan 15, 2015 6:08:35 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Bad work unit

If after 10_12 hours you would hibernate your computer instead of powering down, the job would have continued lossless when powering up next time.

----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 15, 2015 6:16:05 PM]

[Jan 15, 2015 6:14:57 PM]

Sandvika
Advanced Cruncher
United Kingdom
Joined: Apr 27, 2007
Post Count: 112
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

90 day badge for The Clean Energy Project - Phase 2

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Bad work unit

I have a similar issue amongst others with this WU

E227715_310_S.306.C38H22N8.IPLZAQSKDHMLSJ-UHFFFAOYSA-N.4_s1_14_2--	700 Valid 1/16/15 21:49:19	1/17/15 16:42:50 8.87 375.3/375.3
E227715_310_S.306.C38H22N8.IPLZAQSKDHMLSJ-UHFFFAOYSA-N.4_s1_14_1-- 700 Error	1/16/15 03:13:17 1/16/15 21:47:43 18.00 264.2/0.0
E227715_310_S.306.C38H22N8.IPLZAQSKDHMLSJ-UHFFFAOYSA-N.4_s1_14_0-- 700	Error	1/15/15 06:34:37 1/16/15 02:05:18 16.50 340.2/0.0

So basically 18 hours of CPU time were wasted on my computer and 16.5 on another at the expense of other projects. I've not been paying attention but the reason why my estimated time to badges is wildly optimistic is now clear - work is being jettisoned. If you have hyper-threading enabled on your server CPU, you may as well not bother with this project, because it greatly extends the completion time, albeit the CPU gets a lot more done by more efficient use of its execution units. Another WU right now at 99.5% running 18h05 with 0h15 estimated to remain.....it got killed for CPU over-run too.

It should be trivial: if hyper-threading is enabled allow 36 hours not 18.

I've just pulled the plug on this project for this 9 month old Xeon server and will limp to the next badge on my laptop which has a basic quad core processor..... however, being a laptop it's not on 24/7 and there may well be problems with checkpointing in store.

----------------------------------------

[Jan 18, 2015 3:27:40 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Bad work unit

'It should be trivial', and soon those with slower devices, half the speed of your xeon, too want more runtime and on and on, full circle again on a beaten to death discussion sad

. The 18 hours is presently already 2.2 times the active daily average, 2.6 times the project mean. If a device can't do a single job in 18, or cannot run a task without interruption to the first job checkpoint, it is best not opted into this project.

Sorry, but this research is just not for each and every computer. Here only on desktop, and only one at the time as controlled with app_config on the client.

BTW, using some affinity control software like Project Lasso and setting BOINC max cores minus one, you may be able to let the cep2 job run on a physical core without hyperthread leeching. For the 'Yes I can, so I will' technical capable.

[Jan 18, 2015 10:46:00 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Bad work unit

P.S. We -can not- see result details except on the users own account. Posting links only gives the header info. Copy / Paste logs as support to a post.

[Jan 18, 2015 10:48:20 AM]

Sandvika
Advanced Cruncher
United Kingdom
Joined: Apr 27, 2007
Post Count: 112
Status: Offline
Project Badges:


Re: Bad work unit

'It should be trivial', and soon those with slower devices, half the speed of your xeon, too want more runtime and on and on, full circle again on a beaten to death discussion sad

Maybe it's a beaten to death discussion because the 18 hour guillotine isn't stated on the system requirements page and most people affected by it eventually find out the hard way? It's primarily about setting expectations.

In my case 11/31 completed WUs got killed at 18h, 5 are valid, 1 is error but received credit, 3 are error with no credit and 2 still PV. So it might only be 10 to 15% as a total loss....However....

Since it's now clear that neither BOINC nor the WU is determining its execution environment, the easiest solution would be to configure BOINC to use only 50% of the processors available and let the HT-aware OS scheduler allocate the WCG threads one per core. This will still use 80-90% of the spare CPU cycles. Hyper-threading isn't "leeching", it's optimisation to extract that last 10-20%....

So it seems the extra work achieved by HT optimisation is roughly equal to the work lost by HT unawareness in CEP2. On balance, more can be achieved with other projects so that's where I'll focus. I hope the huge new OET tasks that are coming won't be killed similarly - the longest one I processed in beta took 30 hours and the result was fine.

----------------------------------------

[Jan 18, 2015 11:33:57 PM]

Jesse Viviano
Cruncher
United States of America
Joined: Dec 14, 2007
Post Count: 15
Status: Offline
Project Badges:


Re: Bad work unit

Actually, Hyper-Threading works mostly because the 32-bit x86 architecture is a total mismatch to today's hardware reality. It was designed during a time that computer memory ran at the same speed as the CPU, so keeping things in the registers was not much of a speedup. Therefore, the x86 architecture kept the same number of registers, 8 registers, when being extended from 16 to 32 bits. Today, computer memory systems have huge latencies, so keeping things in the registers is critical for performance. In one study cited by my old college textbook, a Pentium Pro spent over half of its cycles retiring (which is marking instructions as complete and therefore committing them) zero instructions, with the rest of the cycles being spent retiring one, two, or three instructions. This was probably because it was stalled by waiting on the memory subsystem. During that time, the execution resources could be used by another thread to get more work done if Hyper-Threading is enabled. AMD solved this problem by doubling the number of registers in most of the register files to 16 registers when designing its AMD64 architecture, but these added registers can only be used in 64-bit mode. The Clean Energy Project - Phase 2 unfortunately is 32-bit only, so it probably leaves lots of performance on the table. There are a few valid reasons for doing this. AMD deprecated the x87 FPU architecture in favor of the SSE2 vector FPU interface because x87's interface is a total mismatch for compilers that forces most compilers to generate inefficient code when it is targeted, but the x87 interface does provide native 80-bit floating point math which SSE2 does not provide. SSE2 only provides 32-bit or 64-bit floating point formats in addition to some integer formats. Since AMD declared that no operating system is required to preserve the contents of the x87 FPU when running a 64-bit program, the x87 FPU interface is generally off-limits for 64-bit programs because some operating systems might preserve the x87 state while swapping threads while others might discard it when swapping threads to speed up the thread swap. If 80-bit floating-point math is required, the program must stay as a 32-bit program for safety because any multithreaded OS running a 32-bit x86 program will preserve the x87 state on a thread swap. The developers might not have the resources needed to develop a 64-bit version as well.

Even if both programs are 64-bit, Hyper-Threading will work well if two programs using different parts of the CPU are paired like a program that hogs the main integer units while the other program hogs the floating point units.

[Jan 19, 2015 3:10:26 AM]

[ ]