Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 76
|
![]() |
Author |
|
martin64
Senior Cruncher Germany Joined: May 11, 2009 Post Count: 445 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Having looked through various Threads, I still couldn't find an explanation about how the process with the large HCMD2 WUs goes. Here is what I think I have understood so far:
----------------------------------------1. All WUs have a distribution of 2, and a quorum of 2 2. The WUs' runtime is hard to predict, resulting in some rather long runtimes 3. Due to the nature of the WUs calculating "positions", these long WUs can e split by the WUs. This is, if the WU time reaches 6 hours, it will continue if the estimated progress is at least 60%, otherwise it is stopped. It will be stopped at 12 hours anyway. 4. The work left over is then distributed again, where the "children" sort of "inherit" the rest of the WU, continue where their "parents" stopped. Now to the stuff I haven't understood: Client-side termination (or better: truncation) of a WU in a quorum-2-environment means that the number of position I have calculated is different to what my wingman has calculated. So how is the validation done? Of course, if my WU is the "shorter" one, the common positions can be validated against each other, so my result will be valid. But how about the "longer" one where we do not have the validation results for? Do we believe that the rest is likely to be correct if some of the result has been validated? Does the "child" WU start at the first position that was not calculated, or at the first position that could not be validated? Regards, Martin ![]() |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I could guess but I'll leave this beauty for knreed to answer, the one who devised the algorithm now being out east of the Spanish plain in an annual BOINC workshop.
----------------------------------------Validation is done on the same [common] positions each has done. The extra positions by the faster device are assumed to be okay, I think to have read. Where a child starts off, don't recollect. From the highest position completed in a quorum or the last one matched? If there is a 100% match requirement it would be the lowest, implying an amount of redundancy. The increased number of homogeneity groups for this project i.e. P3 in P3 group, P4 in P4 group etc already couples devices of similar capability to reduce that redundancy, if so. Aside, think there is enough statistical data to assign a confidence level... for instance is the device rated as reliable? As said though, I'll leave the techs to answer the intricacies as I'm too quite puzzled.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've been wondering about this, too.
----------------------------------------Associated with the method of validation is the question of what positions does the wingman get when I get a child or later-generation WU? Here is a scenario. I forget how many total positions there are in HCMD2 WUs, but for the example, I'll assume 600. Starting with a virgin WU, cruncher A completes all 600 positions. On the 2nd stream, cruncher B does 200. I'm cruncher C and I get positions 201-600. What does my wingman (D) get? My Results Status says that he always gets a WU with the same name as mine, except for the last digit that is the number of the copy. If he really gets positions 200-600, this means that these positions will get to be crunched 3 times. If C and D do not both complete their 400 positions, yet another duplication of crunching will be added. Et cetera. Such a system would be very wasteful. Or are cruncher D and his WU imaginary, with his WU made up of positions extracted from result A? The latter scheme would be most efficient, and also simplest to implement. A new WU would be split into 2 streams. Each stream would be split into as many real WUs as needed to crunch all positions. The initial, parent WUs would be real and identical, while most other wingman WUs would be imaginary. For any WU returned, when all corresponding positions in the opposite stream have been crunched, the imaginary wingman WU could be synthesised if necessary and validation could proceed. Comments? [Edit 1 times, last edit by Rickjb at Oct 26, 2009 8:31:31 AM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Let's first have knreed explain this, part of your question is a paraphrase of what I already said and what martin64 already asked.
----------------------------------------Theoretically a single child result could be run to compute the missing matches of the parents, but then against what to validate? What if the device that does the spare positions is not in the same homogeneity group... u still need a second result in the quorum to get any validation at all... probably a reason why having the complete 5000 parents positions done takes longer as WCG will want to determine which to repackage into the child and grandchild. edit: 5000 parent pairs: http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,25861 When WCG upped the initial cut off time from 4 hours to 6 hours, there was a huge reduction in results... the mean project run time went from 3 hours to 4.6 hours, so do not know if the 15000 descendants is still a valid number.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 26, 2009 9:36:09 AM] |
||
|
Mysteron347
Senior Cruncher Australia Joined: Apr 28, 2007 Post Count: 179 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, I'll join the puzzled club.
Suppose we have IDENTICAL machines which process 40% of a WU in the allotted time. The second generation would appear to be FOUR tasks, starting at the 40% mark and completing to 80% We'd then have a third generation, with EIGHT tasks. There would be massive wasted effort in this case. Suppose we have one machine twice as fast as the others so that the first generation returned results at 40% and 80%. The next generation could be TWO tasks started at the 40% mark, wasting the extra processing by the faster processor and effectively limiting the processing speed to the SLOWER of the processors selected. OR the next generation could start at the 80% mark, violating the entire concept of matching, and hence unlikely. Any other way would seem extremely complex to implement - perhaps tying to combine the 40% processed only one with the 20% totally unprocessed and split it between two tasks. Very ugly - and does not seem to fit with the numbering system for later generations (whereas the start-at-40% and start-at-80% scenarios would - both next-generation tasks carry the same number, bar the replication number.) And it can't be the case that a task is simply completing a partially-completed task, as they ALWAYS appear in pairs. If only the incomplete part was being passed on, every generation beyond the first would have a unique start-position number as part of its task designation. I regret that the "homogeneity group" argument also doesn't hold water. knreed is concerned in that thread about the difference between SSE2 and non-SSE2 processors. From my own results, I have instances where my crunching partners have apparently taken between 36% and 142% of the time I took to crunch what appears to be identical tasks. This means that within my "homogeneity group" there is nearly a 4:1 speed ratio - which must be far greater than the SSE2/no-SSE2 scenario. I theen' someone has some 'splainin' to do... |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Parent is singular in meaning, 1 parent (in quorum 2), 1 child (in quorum 2) etc. It's not exponential, so originally with the shorter run times 5000 parent tasks generated 5000 children, generating 5000 grandchildren then even great-grandchildren. Those I've not seen for a long time,.
----------------------------------------Homogeneity is matching CPU's of similar feature, thus where other projects are just matched Windows/Windows, with HCMD2 there's further specialization, the consequence being more equal run times and yes that can still run a substantial spread, but still less than matching a P3 with a I7-920.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 26, 2009 7:18:58 PM] |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
"... then even great-grandchildren. Those I've not seen for a long time." The greats aren't so unusual. They get a short return date, so you have to be a fast returner to get them:
> CMD2_0132-ARAF.clustersOccur-3BT2_A.clustersOccur_0_17879_22158_21086_22158_21655_21907_0 | rjb-a64x2 | Pending Validation | 26/10/09 09:11:25 | 26/10/09 17:25:03 | 0.14 | 2.2 / 0.0 It's the great-great grandkids that are unusual. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
What platform/cpu combo do you get those GGC's? Maybe driven by those homogeneity groups, suspecting a parent and it's descendants remain in the same family.
----------------------------------------Vaguely I remember the deadlines are shorter on those GGC else the whole sequence taking way to long and with less than 48 hours return time since quite a few days not getting them, even though the short deadline tasks do come through here. Today BTW had a child finishing in ~9.5 hours and few days before a parent ending in ~3.5 hours so the variety is wide.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Example: Parent workunit is set up to compute 1-20,000.
Parent Replica 0 computes 1-5,500 in 6 hours Parent Replica 1 computes 1-5,000 in 6 hours Validation occurs on structures 1-5,000. Structures 1-5,000 are saved. Credit is awarded to parent replica 0 and 1 based on upon the average credit per structure (thus replica 0 is awarded 11.1% more credit then replica 1) Since child workunits are required, the back-end code determines that the most structures that should be computed by a child workunit will those that could be computed in the 6 hour basic limit by an 'average' computer. This results in the following new workunits. Workunit A: 5,001-8,750 Workunit B: 8,751-12,500 Workunit C: 12,5001-16,250 Workunit D: 16,251-20,000 This process then repeats if necessary. For a set of 7 batches that finished yesterday, they had the following distribution of parents, children..... 8372 Parents (26.9%, cumulative: 26.9%) 18248 Children (58.7%, cumulative: 85.6%) 3943 Grandchildren (12.7%, cumulative: 98.3%) 507 Great-grandchildren (1.6%, cumulative: 99.9%) 25 Great-great-grandchildren (0.1%, cumulative: 100.0%) While most of the Children represent 'splits' (i.e 2 or more children are created). The majority of the grandchildren are 'finishers' (i.e. only 1 additional workunit was created to finish off the workunit). If a descendant is required, then the difference in structures between what one host computes and what the second one computes is discarded. It is important to note though, that most workunits have no descendants. Those that do generally have a small number of structures that are computed by one host and not the other. There is a very small percentage of work that is 'lost' due to this technique. |
||
|
mreuter80
Advanced Cruncher Joined: Oct 2, 2006 Post Count: 83 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks knreed for the information.
----------------------------------------It makes very well sense to me. However, I hope you do some performance matching. Otherwise you will see some results like this one: CMD2_ 0139-2A5AA.clustersOccur-1RW6_ A.clustersOccur_ 122_ 1-- 614 Valid 10/18/09 03:50:57 10/19/09 14:56:10 12.00 181.0 / 217.7 <--- mine Well, I know this results is very unusual, but still it makes me wonder whether roughly 10 hours went down the drain for nothing. Don't get me wrong, I believe this is a good system to handle the unpredictable running time of the WUs - just want to mentioned that such odd situations exist. [Edit 3 times, last edit by mreuter80 at Oct 26, 2009 9:47:44 PM] |
||
|
|
![]() |