Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 17
|
![]() |
Author |
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Please see work unit 1277772467. This work unit failed on one user's computer due to exceeding a CPU time limit, and I had to abort it because it never made any significant progress on my machine before it could check point anything on my machine and went past the deadline. Please investigate this work unit. This is the first work unit that I have had to abort on the World Community Grid due to problems with the work unit itself.
|
||
|
Yarensc
Advanced Cruncher USA Joined: Sep 24, 2011 Post Count: 136 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
All CEP work units have a cap of 18 hours. The main reason for this is some of the more complicated molecules accidentally get released to the grid (prediction algorithms aren't perfect) and that limit is in place so that when that happens a machine isn't trying to crunch one for days and days
|
||
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That cap did not work on my machine on this work unit. I generally run my machine for 10 to 12 hours a day, and it never got to a point where a checkpoint can be made. I don't mind losing work due to the infrequent checkpoints, but I do mind never getting to a checkpoint. Since the checkpoint never got reached, so there was no record of any work that took place and the CPU time counter kept getting zeroed out. The 18 hour limit therefore failed, and I had to abort it once the deadline had passed and I realized that I could not ever finish this result.
|
||
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It turns out that someone with a faster CPU than mine was able to handle this work unit. Job 1 of 8 in this work unit took eight and a half hours of contiguous time to process, but the rest of the jobs were normal-sized. The work unit was bad for those who cannot keep their machines running for extremely long amounts of time.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If after 10_12 hours you would hibernate your computer instead of powering down, the job would have continued lossless when powering up next time.
----------------------------------------[Edit 1 times, last edit by Former Member at Jan 15, 2015 6:16:05 PM] |
||
|
Sandvika
Advanced Cruncher United Kingdom Joined: Apr 27, 2007 Post Count: 112 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have a similar issue amongst others with this WU
----------------------------------------E227715_310_S.306.C38H22N8.IPLZAQSKDHMLSJ-UHFFFAOYSA-N.4_s1_14_2-- 700 Valid 1/16/15 21:49:19 1/17/15 16:42:50 8.87 375.3/375.3 So basically 18 hours of CPU time were wasted on my computer and 16.5 on another at the expense of other projects. I've not been paying attention but the reason why my estimated time to badges is wildly optimistic is now clear - work is being jettisoned. If you have hyper-threading enabled on your server CPU, you may as well not bother with this project, because it greatly extends the completion time, albeit the CPU gets a lot more done by more efficient use of its execution units. Another WU right now at 99.5% running 18h05 with 0h15 estimated to remain.....it got killed for CPU over-run too. It should be trivial: if hyper-threading is enabled allow 36 hours not 18. I've just pulled the plug on this project for this 9 month old Xeon server and will limp to the next badge on my laptop which has a basic quad core processor..... however, being a laptop it's not on 24/7 and there may well be problems with checkpointing in store. ![]() ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
'It should be trivial', and soon those with slower devices, half the speed of your xeon, too want more runtime and on and on, full circle again on a beaten to death discussion
![]() Sorry, but this research is just not for each and every computer. Here only on desktop, and only one at the time as controlled with app_config on the client. BTW, using some affinity control software like Project Lasso and setting BOINC max cores minus one, you may be able to let the cep2 job run on a physical core without hyperthread leeching. For the 'Yes I can, so I will' technical capable. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
P.S. We -can not- see result details except on the users own account. Posting links only gives the header info. Copy / Paste logs as support to a post.
|
||
|
Sandvika
Advanced Cruncher United Kingdom Joined: Apr 27, 2007 Post Count: 112 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
'It should be trivial', and soon those with slower devices, half the speed of your xeon, too want more runtime and on and on, full circle again on a beaten to death discussion ![]() Sorry, but this research is just not for each and every computer. Here only on desktop, and only one at the time as controlled with app_config on the client. BTW, using some affinity control software like Project Lasso and setting BOINC max cores minus one, you may be able to let the cep2 job run on a physical core without hyperthread leeching. For the 'Yes I can, so I will' technical capable. Maybe it's a beaten to death discussion because the 18 hour guillotine isn't stated on the system requirements page and most people affected by it eventually find out the hard way? It's primarily about setting expectations. In my case 11/31 completed WUs got killed at 18h, 5 are valid, 1 is error but received credit, 3 are error with no credit and 2 still PV. So it might only be 10 to 15% as a total loss....However.... Since it's now clear that neither BOINC nor the WU is determining its execution environment, the easiest solution would be to configure BOINC to use only 50% of the processors available and let the HT-aware OS scheduler allocate the WCG threads one per core. This will still use 80-90% of the spare CPU cycles. Hyper-threading isn't "leeching", it's optimisation to extract that last 10-20%.... So it seems the extra work achieved by HT optimisation is roughly equal to the work lost by HT unawareness in CEP2. On balance, more can be achieved with other projects so that's where I'll focus. I hope the huge new OET tasks that are coming won't be killed similarly - the longest one I processed in beta took 30 hours and the result was fine. ![]() ![]() |
||
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Actually, Hyper-Threading works mostly because the 32-bit x86 architecture is a total mismatch to today's hardware reality. It was designed during a time that computer memory ran at the same speed as the CPU, so keeping things in the registers was not much of a speedup. Therefore, the x86 architecture kept the same number of registers, 8 registers, when being extended from 16 to 32 bits. Today, computer memory systems have huge latencies, so keeping things in the registers is critical for performance. In one study cited by my old college textbook, a Pentium Pro spent over half of its cycles retiring (which is marking instructions as complete and therefore committing them) zero instructions, with the rest of the cycles being spent retiring one, two, or three instructions. This was probably because it was stalled by waiting on the memory subsystem. During that time, the execution resources could be used by another thread to get more work done if Hyper-Threading is enabled. AMD solved this problem by doubling the number of registers in most of the register files to 16 registers when designing its AMD64 architecture, but these added registers can only be used in 64-bit mode. The Clean Energy Project - Phase 2 unfortunately is 32-bit only, so it probably leaves lots of performance on the table. There are a few valid reasons for doing this. AMD deprecated the x87 FPU architecture in favor of the SSE2 vector FPU interface because x87's interface is a total mismatch for compilers that forces most compilers to generate inefficient code when it is targeted, but the x87 interface does provide native 80-bit floating point math which SSE2 does not provide. SSE2 only provides 32-bit or 64-bit floating point formats in addition to some integer formats. Since AMD declared that no operating system is required to preserve the contents of the x87 FPU when running a 64-bit program, the x87 FPU interface is generally off-limits for 64-bit programs because some operating systems might preserve the x87 state while swapping threads while others might discard it when swapping threads to speed up the thread swap. If 80-bit floating-point math is required, the program must stay as a 32-bit program for safety because any multithreaded OS running a 32-bit x86 program will preserve the x87 state on a thread swap. The developers might not have the resources needed to develop a 64-bit version as well.
Even if both programs are 64-bit, Hyper-Threading will work well if two programs using different parts of the CPU are paired like a program that hogs the main integer units while the other program hogs the floating point units. |
||
|
|
![]() |